Hi,
in parse-de-bitpar.perl the code sequence
while(<STDIN>)
{
foreach (split)
{
s/\(/\*LRB\*/g;
s/\)/\*RRB\*/g;
print TMP $_."\n";
}
print TMP "\n";
}
adds a newline after each single word. Is this required? To me it looks
like bitpar parses sentences on a single line just fine. I'm asking
because this behavior causes trouble with my data down the line:
Annotating my English (source language) corpus with bitpar (while
keeping the French target corpus plain) adds empty lines to the
annotated English source. This brings source and target file out of sync.
The root cause seems to be that internally parse-de-bitpar.perl adds a
newline after each word before feeding it to bitpar. In addition iconv
may eliminate certain characters which lead to empty lines that are
eventually interpreted as a sentence break.
An (admittedly very ugly) segment like:
" you have been invited to community , collection1 by user1 ” , “
message from ” , and “ please use the following url to access the
community .
gets parsed without any obvious error by bitpar when I feed it directly,
or even after being filtered initially through iconv. However within
parse-de-bitpar.perl it gets first converted into:
"
you
have
been
invited
to
community
,
collection1
by
user1
,
message
[...]
Which bitpar parses into 5 sentences
(TOP (X/domV (NP/base (CD \"))(SBAR/0 (-NONE-(0))(S/fin
(NP-SBJ/n3s/base+\#?NPSBJ? (PRP/n3s you) [...]
No parse for: ","
No parse for: "message from"
No parse for: ", and"
(TOP (S/fin/. (NP-SBJ/n3s/base+\#?NPSBJ? (NN please))(VP/n3s_?NPSBJ?
(VVP/nst use)(NP/base (DT/the the)(JJ following)[...]
parse-de-bitpar.perl changes the "No parse for" into empty lines. Since
1 sentence gets unfolded into 5 lines, English source and the
unannotated target get out of sync.
any comments are welcome
best regards
Christof
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support