Hi, in parse-de-bitpar.perl the code sequence
while(<STDIN>) { foreach (split) { s/\(/\*LRB\*/g; s/\)/\*RRB\*/g; print TMP $_."\n"; } print TMP "\n"; } adds a newline after each single word. Is this required? To me it looks like bitpar parses sentences on a single line just fine. I'm asking because this behavior causes trouble with my data down the line: Annotating my English (source language) corpus with bitpar (while keeping the French target corpus plain) adds empty lines to the annotated English source. This brings source and target file out of sync. The root cause seems to be that internally parse-de-bitpar.perl adds a newline after each word before feeding it to bitpar. In addition iconv may eliminate certain characters which lead to empty lines that are eventually interpreted as a sentence break. An (admittedly very ugly) segment like: " you have been invited to community , collection1 by user1 ” , “ message from ” , and “ please use the following url to access the community . gets parsed without any obvious error by bitpar when I feed it directly, or even after being filtered initially through iconv. However within parse-de-bitpar.perl it gets first converted into: " you have been invited to community , collection1 by user1 , message [...] Which bitpar parses into 5 sentences (TOP (X/domV (NP/base (CD \"))(SBAR/0 (-NONE-(0))(S/fin (NP-SBJ/n3s/base+\#?NPSBJ? (PRP/n3s you) [...] No parse for: "," No parse for: "message from" No parse for: ", and" (TOP (S/fin/. (NP-SBJ/n3s/base+\#?NPSBJ? (NN please))(VP/n3s_?NPSBJ? (VVP/nst use)(NP/base (DT/the the)(JJ following)[...] parse-de-bitpar.perl changes the "No parse for" into empty lines. Since 1 sentence gets unfolded into 5 lines, English source and the unannotated target get out of sync. any comments are welcome best regards Christof _______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support