mansur <6688000-re5jqeeqqe8avxtiumw...@public.gmane.org> čálii: > Some examples of Apertium's tagger messing with lines. > > Original: > Китаплар да, кешеләр > дә кайтты. > > Аңа ярдәм > итәргә кирәк. > > Output lines where partial merging occurred: > ^Китаплар да/Китап<n><pl><nom>+да<cnjcoo>$^,/,<cm>$ ^кешеләр > дә/кеше<n><pl><nom>+да<cnjcoo>$ > ^кайтты/кайт<v><tv><ifi><p3><sg>$^./.<sent>$ > > ^Аңа/Ул<prn><dem><dat>$ ^ярдәм итәргә/ярдәм ит<v><tv><inf>$ > ^кирәк/кирәк<n><sg><nom>+и<cop><aor><p3><sg>$^./.<sent>$ > > It is very difficult to find such cases in the big corpus. > > Best! > Mansur
OK, so this isn't actually two lines getting merged into one (that's why the wc -l is the same), but a multiword where the latter part is moved before the linebreak so it can actually be part of the analysis, ie. кешеләр дә on two lines gets the analysis ^кешеләр дә/кеше<n><pl><nom>+да<cnjcoo>$ where the linebreak is output *after* the analysis. Do you not want the multiword analysis here? In that case, putting some noise like .@#@ at the end of lines should work, assuming you have no multiwords with those characters (but when doing translation, the period at least should get an analysis, since unanalysed noise can get moved around (or deleted) by transfer rules). The NUL solution also works, but it seems the tools expect the NUL to come after a superblank like [][\n], so $ sed 's/proc /proc -z /g' modes/tat-mansur.mode >modes/tat-mansur-z.mode $ cat /tmp/test \ | tr -d '\0' \ | apertium-deshtml -n \ | sed 's/\[$/[][/; s/^]/]\x00/' \ | sh modes/tat-mansur-z.mode \ | tr -d '\0' \ | apertium-rehtml-noent ^Китаплар да/Китап<n><pl><nom>+да<cnjcoo>$^,/,<cm>$ ^кешеләр/кеше<n><pl><nom>$ ^дә/да<cnjcoo>$ ^кайтты/кайт<v><tv><ifi><p3><sg>$^./.<sent>$ ^Аңа/Ул<prn><dem><dat>$ ^ярдәм/ярдәм<n><sg><nom>$ ^итәргә/ит<v><tv><inf>$ ^кирәк/кирәк<n><sg><nom>+и<cop><aor><p3><sg>$^./.<sent>$ Maybe it'd make sense to have that as an option to apertium-destxt or similar? So "apertium -f lines -d . tat-mansur" would add the -z's and run with NUL's on each line, making the tools treat each line separately, as if you'd just typed 'echo "$line"|apertium -d . tat-mansur' for every line.
signature.asc
Description: PGP signature
_______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff