Kevin Brubeck Unhammer <[email protected]> writes:

> Francis Tyers <[email protected]> writes:
>
>> El dg 28 de 02 de 2010 a les 21:40 +0200, en/na Harri Pitkänen va
>> escriure:
>>> On Sunday 28 February 2010, Francis Tyers wrote:
>>> > > I don't know Icelandic at all and therefore can't tell whether some of
>>> > > the  words are accepted or rejected incorrectly.
>>> > 
>>> > Nice, it looks good. Some of the capitalised words should be recognised
>>> > corrected, at least 'Bretlandi' and 'Norðmenn' .
>>> 
>>> I tried to fix the checking of capitalized words but started to run into 
>>> problems. It seems that the library API works in somewhat surprising (at 
>>> least 
>>> to me) ways when you enter a word that starts with a capital letter and 
>>> ends 
>>> with garbage.
>>> 
>>> The implementation is here
>>> http://voikko.svn.sourceforge.net/viewvc/voikko/trunk/libvoikko/src/morphology/LttoolboxAnalyzer.cpp?revision=3182&view=markup
>>> 
>>> and test cases here
>>> http://voikko.svn.sourceforge.net/viewvc/voikko/trunk/libvoikko/python/ApertiumIcelandicTest.py?revision=3183&view=markup
>>> 
>>> I was able to get all test cases expect the one with TODO in method name 
>>> implemented. How would you suggest fixing the code so that all tests would 
>>> pass? Of course a patch would be most welcome :)
>>
>> Hmm, strangely enough, when I try an unknown word I get similar strange
>> output:
>>
>> $ ./test mor.bin 
>> ^Reykjanghfghesi$ -->
>> ^Reykja<vblex><actv><inf>/Reykja<vblex><actv><pri><p3><pl>/Reykur<n><m><pl><gen><ind>$
>
> Seems to be a bug with partly-matching regexes in the biltrans
> functions.
>
> Testing the different functions, I get:
>
>     biltransWithQueue: 
> ^Reykja<vblex><actv><inf>/Reykja<vblex><actv><pri><p3><pl>/Reykur<n><m><pl><gen><ind>$
>  qSize: 0
>     biltransWithoutQueue: 
> ^Reykja<vblex><actv><inf>/Reykja<vblex><actv><pri><p3><pl>/Reykur<n><m><pl><gen><ind>$
>     biltrans: 
> ^Reykja<vblex><actv><inf>/Reykja<vblex><actv><pri><p3><pl>/Reykur<n><m><pl><gen><ind>$
>     biltransfull: ^$
>
> But, if I comment out the two regex entries
>
>     <e>                      <par n="persons"/></e>
>     <e>                      <par n="organisations"/></e>
>
> at the end of apertium-is-en.is.dix, I get
>
>     biltransWithQueue: @Reykjanghfghesi qSize: 0
>     biltransWithoutQueue: @Reykjanghfghesi
>     biltrans: @Reykjanghfghesi
>     biltransfull: @Reykjanghfghesi
>
> Similarly on the command line with lt-proc -b (while regular lt-proc -a
> returns unknown, as it should – the persons/orgnisations regexes don't
> fully match either).

I put a patch up at
http://bugs.apertium.org/cgi-bin/bugzilla/show_bug.cgi?id=131 which
solves this for both lt-proc -b, as well as biltransWithQueue. Please
test.

I haven't tried with the other biltrans* functions (I can't see that
they're actually used in the rest of Apertium, so I'm not sure what
they're there for).

It also fixes a problem where superfluous characters after tags would
pass as matches in lt-proc -b (this bug was not present in
biltransWithQueue). It's still possible to carry over _tags_ after the
analysis of course.


I guess it's not strange that this bug was here, since normally you
never have words without tags in bidix, but when using these functions
on a monodix it of course becomes a problem. (And, although it's not
recommended, if people really do want to have non-tagged lemmas in
bidix, lttoolbox should at least not give analyses for lemmas that are
_not_ in the bidix.)


best regards,
Kevin Brubeck Unhammer


------------------------------------------------------------------------------
Special Offer -- Download ArcSight Logger for FREE!
Finally, a world-class log management solution at an even better 
price-free! And you'll get a free "Love Thy Logs" t-shirt when you
download Logger. Secure your free ArcSight Logger TODAY!
http://p.sf.net/sfu/arcsisghtdev2dev
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to