Re: [Apertium-stuff] soft hyphens and tokenisation

Jimmy O'Regan Tue, 17 Apr 2012 09:13:06 -0700

On 17 April 2012 14:51, Kevin Brubeck Unhammer <[email protected]> wrote:
> Hi,
>
> I notice that soft/hidden hyphens (&#173;) can split words, e.g. in
>
>    Jespersen
>
> there's a soft hyphen between n and t, but it should be analysed as one
> word. I've noticed this a lot in web pages, I guess a lot of news sites
> and such use programs that hyphenate using that character.
>
> The problem is, if we don't have the soft hyphen in <alphabet>, we get
> two lexical units; if we have it there, we get one unknown word, even if
> "Jespersen" is in the dix.
>
> Is it possible to use ACX files[1] or something to say that any soft hyphen
> can be skipped? It seems sort of similar to what ACX does at least …


Not really. What ACX does is, where a character appears in
compilation, it inserts an alternative. If the input is 'ș' and 'ş' is
listed as an alternative, as well as inserting the transition ș:ș, it
also inserts the transition ş:ș. It's more or less equivalent to
adding <pardef n="ș"><e><i>ș</i></e><e><p><l>ş</l><r>ș</r></p></e></pardef>
and then substituting it:
:%s/ș/<pardef n="ș"\/>/g
(but without the pitfalls :)

Or, in other words, with ACX we know where to make the replacements;
with a soft hyphen, we don't.

Trying to build it into the transducer would involve various shades of
madness, from the outright insane (insert an optional soft hyphen
after every non-final character) to the slightly insane, extremely
tedious (insert the optional transitions based on a hyphenation
dictionary).

-- 
<Sefam> Are any of the mentors around?
<jimregan> yes, they're the ones trolling you

------------------------------------------------------------------------------
Better than sec? Nothing is better than sec when it comes to
monitoring Big Data applications. Try Boundary one-second 
resolution app monitoring today. Free.
http://p.sf.net/sfu/Boundary-dev2dev
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] soft hyphens and tokenisation

Reply via email to