Hello!

It was quite difficult to find those few cases when lines merging happens
in the file containing dozens of millions of lines overall. And it seems
that in this case it is not because of apertium's tagger like the previous
time. These cases happened because there were some non-printable characters
that other my scripts removed...

It seems that Fran's recommendation worked out and apertium doesn't merge
lines with that line ending. I modified it a little bit for the convenience:

I add before passing it to the tagger:
sed -r 's/$/ __@@__@@__ @\.@#\.#/'

And remove it this way after tagging is done:
sed -r 's/ *__@@__@@__.*$//'

The issue I want to consult with you is: doesn't this part somehow affect
tagger's marking of other words in a sentence? Doesn't it change other
words' POS that were guessed by tagger?

Thank you!
Mansur

Am Mo., 5. Nov. 2018 um 12:54 Uhr schrieb Kevin Brubeck Unhammer <
unham...@fsfe.org>:

> mansur <6688...@gmail.com> čálii:
>
> > Hello!
> >
> > 1) I tried all the solutions recommended here to avoid merging lines, but
> > nothing helped... The only thing I didn't try yet is apertium-apy, but
> > Kevin said this way is at least 4 times slower.
>
> With the tat-mansur mode (git pull && make) I get the same amount of
> lines for the txt files in dev:
>
> $ cat dev/*.txt|wc -l
> 3866
> $ cat dev/*.txt |apertium -d . tat-mansur |wc -l
> 3866
>
> Can you try to figure out where in your test corpus this happens, and
> give a minimal example?
>
> _______________________________________________
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to