Kevin Brubeck Unhammer <[email protected]> writes: > Francis Tyers <[email protected]> writes: > >> El dg 28 de 02 de 2010 a les 21:40 +0200, en/na Harri Pitkänen va >> escriure: >>> On Sunday 28 February 2010, Francis Tyers wrote: >>> > > I don't know Icelandic at all and therefore can't tell whether some of >>> > > the words are accepted or rejected incorrectly. >>> > >>> > Nice, it looks good. Some of the capitalised words should be recognised >>> > corrected, at least 'Bretlandi' and 'Norðmenn' . >>> >>> I tried to fix the checking of capitalized words but started to run into >>> problems. It seems that the library API works in somewhat surprising (at >>> least >>> to me) ways when you enter a word that starts with a capital letter and >>> ends >>> with garbage. >>> >>> The implementation is here >>> http://voikko.svn.sourceforge.net/viewvc/voikko/trunk/libvoikko/src/morphology/LttoolboxAnalyzer.cpp?revision=3182&view=markup >>> >>> and test cases here >>> http://voikko.svn.sourceforge.net/viewvc/voikko/trunk/libvoikko/python/ApertiumIcelandicTest.py?revision=3183&view=markup >>> >>> I was able to get all test cases expect the one with TODO in method name >>> implemented. How would you suggest fixing the code so that all tests would >>> pass? Of course a patch would be most welcome :) >> >> Hmm, strangely enough, when I try an unknown word I get similar strange >> output: >> >> $ ./test mor.bin >> ^Reykjanghfghesi$ --> >> ^Reykja<vblex><actv><inf>/Reykja<vblex><actv><pri><p3><pl>/Reykur<n><m><pl><gen><ind>$ > > Seems to be a bug with partly-matching regexes in the biltrans > functions. > > Testing the different functions, I get: > > biltransWithQueue: > ^Reykja<vblex><actv><inf>/Reykja<vblex><actv><pri><p3><pl>/Reykur<n><m><pl><gen><ind>$ > qSize: 0 > biltransWithoutQueue: > ^Reykja<vblex><actv><inf>/Reykja<vblex><actv><pri><p3><pl>/Reykur<n><m><pl><gen><ind>$ > biltrans: > ^Reykja<vblex><actv><inf>/Reykja<vblex><actv><pri><p3><pl>/Reykur<n><m><pl><gen><ind>$ > biltransfull: ^$ > > But, if I comment out the two regex entries > > <e> <par n="persons"/></e> > <e> <par n="organisations"/></e> > > at the end of apertium-is-en.is.dix, I get > > biltransWithQueue: @Reykjanghfghesi qSize: 0 > biltransWithoutQueue: @Reykjanghfghesi > biltrans: @Reykjanghfghesi > biltransfull: @Reykjanghfghesi > > Similarly on the command line with lt-proc -b (while regular lt-proc -a > returns unknown, as it should – the persons/orgnisations regexes don't > fully match either).
I put a patch up at http://bugs.apertium.org/cgi-bin/bugzilla/show_bug.cgi?id=131 which solves this for both lt-proc -b, as well as biltransWithQueue. Please test. I haven't tried with the other biltrans* functions (I can't see that they're actually used in the rest of Apertium, so I'm not sure what they're there for). It also fixes a problem where superfluous characters after tags would pass as matches in lt-proc -b (this bug was not present in biltransWithQueue). It's still possible to carry over _tags_ after the analysis of course. I guess it's not strange that this bug was here, since normally you never have words without tags in bidix, but when using these functions on a monodix it of course becomes a problem. (And, although it's not recommended, if people really do want to have non-tagged lemmas in bidix, lttoolbox should at least not give analyses for lemmas that are _not_ in the bidix.) best regards, Kevin Brubeck Unhammer ------------------------------------------------------------------------------ Special Offer -- Download ArcSight Logger for FREE! Finally, a world-class log management solution at an even better price-free! And you'll get a free "Love Thy Logs" t-shirt when you download Logger. Secure your free ArcSight Logger TODAY! http://p.sf.net/sfu/arcsisghtdev2dev _______________________________________________ Apertium-stuff mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/apertium-stuff
