W dniu 2013-05-08 18:08, Daniel Naber pisze:
> Am 07.05.2013 23:33, schrieb Marcin Miłkowski:
>
>> Well, for me it seems to be the same issue still, as I haven't been
>> given any reason to believe that hunspell expansion would not give me
>> a
>> compounding mechanism for our speller (beyond the size of the word
>> list).
>
> I see no reason other than the size of the list. As every noun can
> basically be combined with every other noun, you'll have 30,000^2
> combinations if there are 30,000 nouns. And as there are not only
> compounds made up of two words, you'd have another 30,000^3 words if you
> consider all three-part compounds.
Yes, I did some calculations this morning and I can see your point.
> But the way hunspell works can probably be mapped to an FSA. The
> hunspell compound tags of the words say:
> * this is only a compound beginning, not a stand-alone word ("Arbeits"
> in German)
> * this is only a compounds part, but not at the beginning (basically
> any noun but spelled lowercase, and a lot of other words)
> * this is a noun that can both be used stand-alone, but also as a
> compound beginning (most nouns in German)
This is very easy to implement if we have some tags in the automaton,
and we could have a method that works in an opposite way than the
replaceRunOnWords() method. Also, these tags could be used by
replaceRunOnWords() to disallow putting spaces for compound beginnings
(I guess there might be also compound suffixes as well).
>
> Actually the tags' meaning might be slightly different (didn't look
> them up now), but all if this can be, I think, expressed by interpreting
> and FSA that's built accordingly and without the need to generate a word
> list. A black list of "invalid" words is needed anyway.
Yes, and this may be embedded in the automaton with an additional
"forbid" flag.
>
> I don't have time to dig into this now, but could write test cases etc.
I'm afraid I'm a little too busy lately but given the community effort
in diacritics and multi-character replacements, I think we could make it
work together.
But first, we need to have a good specification on our wiki.
Regards,
Marcin
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and
their applications. This 200-page book is written by three acclaimed
leaders in the field. The early access version is available now.
Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
_______________________________________________
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel