El dv 02 de 09 de 2011 a les 13:24 +0000, en/na Francis Tyers va escriure: > El dv 02 de 09 de 2011 a les 11:13 +0200, en/na Kevin Brubeck Unhammer > va escriure: > > Kevin Brubeck Unhammer <[email protected]> writes: > > > > > Francis Tyers <[email protected]> writes: > > > > > >> El dg 28 de 02 de 2010 a les 21:40 +0200, en/na Harri Pitkänen va > > >> escriure: > > >>> On Sunday 28 February 2010, Francis Tyers wrote: > > >>> > > I don't know Icelandic at all and therefore can't tell whether some > > >>> > > of > > >>> > > the words are accepted or rejected incorrectly. > > >>> > > > >>> > Nice, it looks good. Some of the capitalised words should be > > >>> > recognised > > >>> > corrected, at least 'Bretlandi' and 'Norðmenn' . > > >>> > > >>> I tried to fix the checking of capitalized words but started to run > > >>> into > > >>> problems. It seems that the library API works in somewhat surprising > > >>> (at least > > >>> to me) ways when you enter a word that starts with a capital letter and > > >>> ends > > >>> with garbage. > > >>> > > >>> The implementation is here > > >>> http://voikko.svn.sourceforge.net/viewvc/voikko/trunk/libvoikko/src/morphology/LttoolboxAnalyzer.cpp?revision=3182&view=markup > > >>> > > >>> and test cases here > > >>> http://voikko.svn.sourceforge.net/viewvc/voikko/trunk/libvoikko/python/ApertiumIcelandicTest.py?revision=3183&view=markup > > >>> > > >>> I was able to get all test cases expect the one with TODO in method > > >>> name > > >>> implemented. How would you suggest fixing the code so that all tests > > >>> would > > >>> pass? Of course a patch would be most welcome :) > > >> > > >> Hmm, strangely enough, when I try an unknown word I get similar strange > > >> output: > > >> > > >> $ ./test mor.bin > > >> ^Reykjanghfghesi$ --> > > >> ^Reykja<vblex><actv><inf>/Reykja<vblex><actv><pri><p3><pl>/Reykur<n><m><pl><gen><ind>$ > > > > > > Seems to be a bug with partly-matching regexes in the biltrans > > > functions. > > > > > > Testing the different functions, I get: > > > > > > biltransWithQueue: > > > ^Reykja<vblex><actv><inf>/Reykja<vblex><actv><pri><p3><pl>/Reykur<n><m><pl><gen><ind>$ > > > qSize: 0 > > > biltransWithoutQueue: > > > ^Reykja<vblex><actv><inf>/Reykja<vblex><actv><pri><p3><pl>/Reykur<n><m><pl><gen><ind>$ > > > biltrans: > > > ^Reykja<vblex><actv><inf>/Reykja<vblex><actv><pri><p3><pl>/Reykur<n><m><pl><gen><ind>$ > > > biltransfull: ^$ > > > > > > But, if I comment out the two regex entries > > > > > > <e> <par n="persons"/></e> > > > <e> <par n="organisations"/></e> > > > > > > at the end of apertium-is-en.is.dix, I get > > > > > > biltransWithQueue: @Reykjanghfghesi qSize: 0 > > > biltransWithoutQueue: @Reykjanghfghesi > > > biltrans: @Reykjanghfghesi > > > biltransfull: @Reykjanghfghesi > > > > > > Similarly on the command line with lt-proc -b (while regular lt-proc -a > > > returns unknown, as it should – the persons/orgnisations regexes don't > > > fully match either). > > > > I put a patch up at > > http://bugs.apertium.org/cgi-bin/bugzilla/show_bug.cgi?id=131 which > > solves this for both lt-proc -b, as well as biltransWithQueue. Please > > test. > > > > I haven't tried with the other biltrans* functions (I can't see that > > they're actually used in the rest of Apertium, so I'm not sure what > > they're there for). > > > > It also fixes a problem where superfluous characters after tags would > > pass as matches in lt-proc -b (this bug was not present in > > biltransWithQueue). It's still possible to carry over _tags_ after the > > analysis of course. > > > > > > I guess it's not strange that this bug was here, since normally you > > never have words without tags in bidix, but when using these functions > > on a monodix it of course becomes a problem. (And, although it's not > > recommended, if people really do want to have non-tagged lemmas in > > bidix, lttoolbox should at least not give analyses for lemmas that are > > _not_ in the bidix.) > > > > > > best regards, > > Kevin Brubeck Unhammer > > Looks good to me, and to Jim. We suggest commit and close. I'm going to > do one final test, running a corpus with lt-proc -b before and after the > patch and see if there are any difference. I'll report back soon.
$ wc -l /tmp/ca-BILTRANS.* 376857 /tmp/ca-BILTRANS.new 376857 /tmp/ca-BILTRANS.old 753714 total $ cmp /tmp/ca-BILTRANS.old /tmp/ca-BILTRANS.new No changes in ca->en over 376857 lines of the Catalan Wikipedia. Fran ------------------------------------------------------------------------------ Special Offer -- Download ArcSight Logger for FREE! Finally, a world-class log management solution at an even better price-free! And you'll get a free "Love Thy Logs" t-shirt when you download Logger. Secure your free ArcSight Logger TODAY! http://p.sf.net/sfu/arcsisghtdev2dev _______________________________________________ Apertium-stuff mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/apertium-stuff
