Hello! It was quite difficult to find those few cases when lines merging happens in the file containing dozens of millions of lines overall. And it seems that in this case it is not because of apertium's tagger like the previous time. These cases happened because there were some non-printable characters that other my scripts removed...
It seems that Fran's recommendation worked out and apertium doesn't merge lines with that line ending. I modified it a little bit for the convenience: I add before passing it to the tagger: sed -r 's/$/ __@@__@@__ @\.@#\.#/' And remove it this way after tagging is done: sed -r 's/ *__@@__@@__.*$//' The issue I want to consult with you is: doesn't this part somehow affect tagger's marking of other words in a sentence? Doesn't it change other words' POS that were guessed by tagger? Thank you! Mansur Am Mo., 5. Nov. 2018 um 12:54 Uhr schrieb Kevin Brubeck Unhammer < unham...@fsfe.org>: > mansur <6688...@gmail.com> čálii: > > > Hello! > > > > 1) I tried all the solutions recommended here to avoid merging lines, but > > nothing helped... The only thing I didn't try yet is apertium-apy, but > > Kevin said this way is at least 4 times slower. > > With the tat-mansur mode (git pull && make) I get the same amount of > lines for the txt files in dev: > > $ cat dev/*.txt|wc -l > 3866 > $ cat dev/*.txt |apertium -d . tat-mansur |wc -l > 3866 > > Can you try to figure out where in your test corpus this happens, and > give a minimal example? > > _______________________________________________ > Apertium-stuff mailing list > Apertium-stuff@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/apertium-stuff >
_______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff