Hi,

in parse-de-bitpar.perl the code sequence

while(<STDIN>)
{
     foreach (split)
     {
         s/\(/\*LRB\*/g;
         s/\)/\*RRB\*/g;
         print TMP $_."\n";
     }
     print TMP "\n";
}

adds a newline after each single word. Is this required? To me it looks 
like bitpar parses sentences on a single line just fine. I'm asking 
because this behavior causes trouble with my data down the line:




Annotating my English (source language) corpus with bitpar (while 
keeping the French target corpus plain) adds empty lines to the 
annotated English source. This brings source and target file out of sync.

The root cause seems to be that internally parse-de-bitpar.perl adds a 
newline after each word before feeding it to bitpar. In addition iconv 
may eliminate certain characters which lead to empty lines that are 
eventually interpreted as a sentence break.

An (admittedly very ugly) segment like:

" you have been invited to community , collection1 by user1 ” , “ 
message from ” , and “ please use the following url to access the 
community .

gets parsed without any obvious error by bitpar when I feed it directly, 
or even after being filtered initially through iconv. However within 
parse-de-bitpar.perl it gets first converted into:

"
you
have
been
invited
to
community
,
collection1
by
user1

,

message
[...]


Which bitpar parses into 5 sentences

(TOP (X/domV (NP/base (CD \"))(SBAR/0  (-NONE-(0))(S/fin 
(NP-SBJ/n3s/base+\#?NPSBJ? (PRP/n3s you) [...]
No parse for: ","
No parse for: "message from"
No parse for: ", and"
(TOP (S/fin/. (NP-SBJ/n3s/base+\#?NPSBJ? (NN please))(VP/n3s_?NPSBJ? 
(VVP/nst use)(NP/base (DT/the the)(JJ following)[...]


parse-de-bitpar.perl changes the "No parse for" into empty lines. Since 
1 sentence gets unfolded into 5 lines, English source and the 
unannotated target get out of sync.

any comments are welcome

best regards
Christof



_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to