On 17 April 2012 14:51, Kevin Brubeck Unhammer <[email protected]> wrote: > Hi, > > I notice that soft/hidden hyphens (­) can split words, e.g. in > > Jespersen > > there's a soft hyphen between n and t, but it should be analysed as one > word. I've noticed this a lot in web pages, I guess a lot of news sites > and such use programs that hyphenate using that character. > > The problem is, if we don't have the soft hyphen in <alphabet>, we get > two lexical units; if we have it there, we get one unknown word, even if > "Jespersen" is in the dix. > > Is it possible to use ACX files[1] or something to say that any soft hyphen > can be skipped? It seems sort of similar to what ACX does at least …
Not really. What ACX does is, where a character appears in compilation, it inserts an alternative. If the input is 'ș' and 'ş' is listed as an alternative, as well as inserting the transition ș:ș, it also inserts the transition ş:ș. It's more or less equivalent to adding <pardef n="ș"><e><i>ș</i></e><e><p><l>ş</l><r>ș</r></p></e></pardef> and then substituting it: :%s/ș/<pardef n="ș"\/>/g (but without the pitfalls :) Or, in other words, with ACX we know where to make the replacements; with a soft hyphen, we don't. Trying to build it into the transducer would involve various shades of madness, from the outright insane (insert an optional soft hyphen after every non-final character) to the slightly insane, extremely tedious (insert the optional transitions based on a hyphenation dictionary). -- <Sefam> Are any of the mentors around? <jimregan> yes, they're the ones trolling you ------------------------------------------------------------------------------ Better than sec? Nothing is better than sec when it comes to monitoring Big Data applications. Try Boundary one-second resolution app monitoring today. Free. http://p.sf.net/sfu/Boundary-dev2dev _______________________________________________ Apertium-stuff mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/apertium-stuff
