Hi,
I notice that soft/hidden hyphens (­) can split words, e.g. in
Jespersen
there's a soft hyphen between n and t, but it should be analysed as one
word. I've noticed this a lot in web pages, I guess a lot of news sites
and such use programs that hyphenate using that character.
The problem is, if we don't have the soft hyphen in <alphabet>, we get
two lexical units; if we have it there, we get one unknown word, even if
"Jespersen" is in the dix.
Is it possible to use ACX files[1] or something to say that any soft hyphen
can be skipped? It seems sort of similar to what ACX does at least …
[1] http://wiki.apertium.org/wiki/Acx
-Kevin
------------------------------------------------------------------------------
Better than sec? Nothing is better than sec when it comes to
monitoring Big Data applications. Try Boundary one-second
resolution app monitoring today. Free.
http://p.sf.net/sfu/Boundary-dev2dev
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff