El dv 02 de 09 de 2011 a les 13:24 +0000, en/na Francis Tyers va
escriure:
> El dv 02 de 09 de 2011 a les 11:13 +0200, en/na Kevin Brubeck Unhammer
> va escriure:
> > Kevin Brubeck Unhammer <[email protected]> writes:
> > 
> > > Francis Tyers <[email protected]> writes:
> > >
> > >> El dg 28 de 02 de 2010 a les 21:40 +0200, en/na Harri Pitkänen va
> > >> escriure:
> > >>> On Sunday 28 February 2010, Francis Tyers wrote:
> > >>> > > I don't know Icelandic at all and therefore can't tell whether some 
> > >>> > > of
> > >>> > > the  words are accepted or rejected incorrectly.
> > >>> > 
> > >>> > Nice, it looks good. Some of the capitalised words should be 
> > >>> > recognised
> > >>> > corrected, at least 'Bretlandi' and 'Norðmenn' .
> > >>> 
> > >>> I tried to fix the checking of capitalized words but started to run 
> > >>> into 
> > >>> problems. It seems that the library API works in somewhat surprising 
> > >>> (at least 
> > >>> to me) ways when you enter a word that starts with a capital letter and 
> > >>> ends 
> > >>> with garbage.
> > >>> 
> > >>> The implementation is here
> > >>> http://voikko.svn.sourceforge.net/viewvc/voikko/trunk/libvoikko/src/morphology/LttoolboxAnalyzer.cpp?revision=3182&view=markup
> > >>> 
> > >>> and test cases here
> > >>> http://voikko.svn.sourceforge.net/viewvc/voikko/trunk/libvoikko/python/ApertiumIcelandicTest.py?revision=3183&view=markup
> > >>> 
> > >>> I was able to get all test cases expect the one with TODO in method 
> > >>> name 
> > >>> implemented. How would you suggest fixing the code so that all tests 
> > >>> would 
> > >>> pass? Of course a patch would be most welcome :)
> > >>
> > >> Hmm, strangely enough, when I try an unknown word I get similar strange
> > >> output:
> > >>
> > >> $ ./test mor.bin 
> > >> ^Reykjanghfghesi$ -->
> > >> ^Reykja<vblex><actv><inf>/Reykja<vblex><actv><pri><p3><pl>/Reykur<n><m><pl><gen><ind>$
> > >
> > > Seems to be a bug with partly-matching regexes in the biltrans
> > > functions.
> > >
> > > Testing the different functions, I get:
> > >
> > >     biltransWithQueue: 
> > > ^Reykja<vblex><actv><inf>/Reykja<vblex><actv><pri><p3><pl>/Reykur<n><m><pl><gen><ind>$
> > >  qSize: 0
> > >     biltransWithoutQueue: 
> > > ^Reykja<vblex><actv><inf>/Reykja<vblex><actv><pri><p3><pl>/Reykur<n><m><pl><gen><ind>$
> > >     biltrans: 
> > > ^Reykja<vblex><actv><inf>/Reykja<vblex><actv><pri><p3><pl>/Reykur<n><m><pl><gen><ind>$
> > >     biltransfull: ^$
> > >
> > > But, if I comment out the two regex entries
> > >
> > >     <e>                      <par n="persons"/></e>
> > >     <e>                      <par n="organisations"/></e>
> > >
> > > at the end of apertium-is-en.is.dix, I get
> > >
> > >     biltransWithQueue: @Reykjanghfghesi qSize: 0
> > >     biltransWithoutQueue: @Reykjanghfghesi
> > >     biltrans: @Reykjanghfghesi
> > >     biltransfull: @Reykjanghfghesi
> > >
> > > Similarly on the command line with lt-proc -b (while regular lt-proc -a
> > > returns unknown, as it should – the persons/orgnisations regexes don't
> > > fully match either).
> > 
> > I put a patch up at
> > http://bugs.apertium.org/cgi-bin/bugzilla/show_bug.cgi?id=131 which
> > solves this for both lt-proc -b, as well as biltransWithQueue. Please
> > test.
> > 
> > I haven't tried with the other biltrans* functions (I can't see that
> > they're actually used in the rest of Apertium, so I'm not sure what
> > they're there for).
> > 
> > It also fixes a problem where superfluous characters after tags would
> > pass as matches in lt-proc -b (this bug was not present in
> > biltransWithQueue). It's still possible to carry over _tags_ after the
> > analysis of course.
> > 
> > 
> > I guess it's not strange that this bug was here, since normally you
> > never have words without tags in bidix, but when using these functions
> > on a monodix it of course becomes a problem. (And, although it's not
> > recommended, if people really do want to have non-tagged lemmas in
> > bidix, lttoolbox should at least not give analyses for lemmas that are
> > _not_ in the bidix.)
> > 
> > 
> > best regards,
> > Kevin Brubeck Unhammer
> 
> Looks good to me, and to Jim. We suggest commit and close. I'm going to
> do one final test, running a corpus with lt-proc -b before and after the
> patch and see if there are any difference. I'll report back soon.

$ wc -l /tmp/ca-BILTRANS.*
   376857 /tmp/ca-BILTRANS.new
   376857 /tmp/ca-BILTRANS.old
   753714 total

$ cmp /tmp/ca-BILTRANS.old /tmp/ca-BILTRANS.new

No changes in ca->en over 376857 lines of the Catalan Wikipedia.

Fran


------------------------------------------------------------------------------
Special Offer -- Download ArcSight Logger for FREE!
Finally, a world-class log management solution at an even better 
price-free! And you'll get a free "Love Thy Logs" t-shirt when you
download Logger. Secure your free ArcSight Logger TODAY!
http://p.sf.net/sfu/arcsisghtdev2dev
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to