"Jimmy O'Regan" <[email protected]>
writes:

> On 17 April 2012 14:51, Kevin Brubeck Unhammer <[email protected]> wrote:
>> Hi,
>>
>> I notice that soft/hidden hyphens (&#173;) can split words, e.g. in
>>
>>    Jesper­sen
>>
>> there's a soft hyphen between n and t, but it should be analysed as one
>> word. I've noticed this a lot in web pages, I guess a lot of news sites
>> and such use programs that hyphenate using that character.
>>
>> The problem is, if we don't have the soft hyphen in <alphabet>, we get
>> two lexical units; if we have it there, we get one unknown word, even if
>> "Jespersen" is in the dix.
>>
>> Is it possible to use ACX files[1] or something to say that any soft hyphen
>> can be skipped? It seems sort of similar to what ACX does at least …
>
> Not really. What ACX does is, where a character appears in
> compilation, it inserts an alternative. If the input is 'ș' and 'ş' is
> listed as an alternative, as well as inserting the transition ș:ș, it
> also inserts the transition ş:ș. It's more or less equivalent to
> adding <pardef n="ș"><e><i>ș</i></e><e><p><l>ş</l><r>ș</r></p></e></pardef>
> and then substituting it:
> :%s/ș/<pardef n="ș"\/>/g
> (but without the pitfalls :)
>
> Or, in other words, with ACX we know where to make the replacements;
> with a soft hyphen, we don't.
>
> Trying to build it into the transducer would involve various shades of
> madness, from the outright insane (insert an optional soft hyphen
> after every non-final character) to the slightly insane, extremely
> tedious (insert the optional transitions based on a hyphenation
> dictionary).

:-/ Another alternative would be to have a second alphabet of skippable
characters that's consulted by lt-proc while analysing, but that seems
like a lot of work for such a little thing.

I guess sed is the saner alternative, even though it can't promise to
keep these "blanks" throughout the process. Come to think of it, this is
similar to the problem of non-space superblanks in the middle of words
(Jesper<i>sen) which was discussed some time ago without finding a real
solution (apart from sedding the blank outside the word before
translating).



-Kevin


------------------------------------------------------------------------------
Better than sec? Nothing is better than sec when it comes to
monitoring Big Data applications. Try Boundary one-second 
resolution app monitoring today. Free.
http://p.sf.net/sfu/Boundary-dev2dev
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to