Turned out disappears the last token in the meaning of Apertium, no matter it is a word or punctuation, just last part like ^./.<sent>$ or ^word/lemma<pos><tag1><tag2>$
Am Mi., 7. Nov. 2018 um 19:02 Uhr schrieb mansur <6688...@gmail.com>: > Hello! > > It doesn't work for me: > ><px3sp><nom>+да<cnjcoo>$ ^бит/бит<mod_ass>$ > _ > ^ул/бул<v><tv><imp><p2><sg>$^,/,<cm>$ ^театраль/театраль<adj>$ > ^жест/*жест$ ^ясап/яса<v><tv><gna_perf>$ > ^-/-<guio>$ ^Синнән/Син<prn><pers><p2><sg><abl>$ > ^сорап/сора<v><tv><prc_perf>$ ^торырмын/тор<vaux><fut><p1><sg>$ > ^Барлык/Барлык<det><qnt>$ ^иптәшләрдән/иптәш<n><pl><abl>$ > ^кул/кул<n><sg><nom>+и<cop><aor><p3><sg>$ ^куйды/куй<v><tv><ifi><p3><sg>$ > ^рам да/рам<n><sg><nom>+да<cnjcoo>$ ^тикшерү/тикшерү<n><sg><attr>$ > ^органнарына/орган<n><pl><px3sp><dat>$ > _ > _ > _ > > Problems are where we see _ symbol. In the end 3 new lines. And almost > each line loses last character or even words (it should be "рам да тикшерү > органнарына тапшыра"). > > By the way, rules: > tr '\n' '\0' | > apertium-destxt -n | > lt-proc -z -w 'apertium-tat/tat.automorf.bin' | > cg-proc -z 'apertium-tat/tat.rlx.bin' | > cg-proc -z -w -1 'apertium-tat/dev/mansur.bin' | > tr '\0' '\n' | > apertium-retxt | > > Replacing these 'tr' commands with previous recommendations from Fran > gives correct output. > > Am Di., 6. Nov. 2018 um 22:45 Uhr schrieb Francis Tyers < > fty...@prompsit.com>: > >> El 2018-11-06 20:36, Kevin Brubeck Unhammer escribió: >> > Francis Tyers <fty...@prompsit.com> čálii: >> > >> >> Yes it does. It will put a sentence boundary after every word, meaning >> >> that you won't get reliable tagger output. Apertium as far as I know >> >> has no way to treat sentences as a sequence of lines. This is because >> >> of how the format handling works. >> >> >> >> I think it would really be an excellent feature though. Perhaps a >> >> GitHub issue? I do however think it would involve messing with quite a >> >> bit of the pipeline. >> > >> > However, we *should* treat NUL as hard separators – if we don't, >> > apertium-apy (and thus www.apertium.org) will risk sending output meant >> > for person1 to person2. (I have an inkling there might still be bugs in >> > apertium-transfer related to this.) >> > >> > Anyway, if we at least handle NUL's correctly in lt-proc and cg-proc, >> > you could turn linebreak's into NUL's (first deleting any existing >> > NUL's >> > in the corpus) and tag with the -z option to lt-/cg-proc: >> > >> > cat corpus.txt \ >> > | tr -d '\0' \ >> > | tr '\n' '\0' \ >> > | apertium-deshtml -n \ >> > | lt-proc -z -w 'apertium-tat/tat.automorf.bin' \ >> > | cg-proc -z 'apertium-tat/tat.rlx.bin' \ >> > | cg-proc -z -w -1 'apertium-tat/dev/mansur.bin' \ >> > | tr '\0' '\n' \ >> > | apertium-rehtml-noent >> > >> > … finally turning NUL's back into newlines. >> > >> > With apertium-nob, this doesn't seem to run slower than without -z, and >> > doesn't merge lines in my test corpus. >> > >> >> Ooh, this is great, we should probably put this on the wiki! >> >> F. >> >> >> _______________________________________________ >> Apertium-stuff mailing list >> Apertium-stuff@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/apertium-stuff >> >
_______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff