[Apertium-stuff] soft hyphens and tokenisation

Kevin Brubeck Unhammer Tue, 17 Apr 2012 06:52:42 -0700

Hi,

I notice that soft/hidden hyphens (&#173;) can split words, e.g. in


    Jespersen

there's a soft hyphen between n and t, but it should be analysed as one
word. I've noticed this a lot in web pages, I guess a lot of news sites
and such use programs that hyphenate using that character.

The problem is, if we don't have the soft hyphen in <alphabet>, we get
two lexical units; if we have it there, we get one unknown word, even if
"Jespersen" is in the dix.

Is it possible to use ACX files[1] or something to say that any soft hyphen
can be skipped? It seems sort of similar to what ACX does at least …


[1]  http://wiki.apertium.org/wiki/Acx


-Kevin





------------------------------------------------------------------------------
Better than sec? Nothing is better than sec when it comes to
monitoring Big Data applications. Try Boundary one-second 
resolution app monitoring today. Free.
http://p.sf.net/sfu/Boundary-dev2dev
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

[Apertium-stuff] soft hyphens and tokenisation

Reply via email to