Re: [Apertium-stuff] soft hyphens and tokenisation

Kevin Brubeck Unhammer Tue, 17 Apr 2012 07:01:04 -0700

Kevin Brubeck Unhammer <[email protected]> writes:

> Hi,
>
> I notice that soft/hidden hyphens (&#173;) can split words, e.g. in
>
>     Jespersen
>
> there's a soft hyphen between n and t, but it should be analysed as one


Wops, between r and s!

> word. I've noticed this a lot in web pages, I guess a lot of news sites
> and such use programs that hyphenate using that character.
>
> The problem is, if we don't have the soft hyphen in <alphabet>, we get
> two lexical units; if we have it there, we get one unknown word, even if
> "Jespersen" is in the dix.
>
> Is it possible to use ACX files[1] or something to say that any soft hyphen
> can be skipped? It seems sort of similar to what ACX does at least …
>
>
> [1]  http://wiki.apertium.org/wiki/Acx
>
>
> -Kevin
>
>


------------------------------------------------------------------------------
Better than sec? Nothing is better than sec when it comes to
monitoring Big Data applications. Try Boundary one-second 
resolution app monitoring today. Free.
http://p.sf.net/sfu/Boundary-dev2dev
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] soft hyphens and tokenisation

Reply via email to