Hi, I see tokenizer.perl now offers an option for excluding URLs and other expressions. " -protect FILE ... specify file with patters to be protected in tokenisation." Unfortunately there is no explanation of how this optional file should be. I tried several ways of writing regular expressions for URLs, but URLs still come out tokenized. Could you provide an example?
My second question concerns the -a option, for aggressive hyphen splitting. Does the detokenizer offer a similar option, to reconstructed separeted hyphens? cheers Lefteris -- MSc. Inf. Eleftherios Avramidis DFKI GmbH, Alt-Moabit 91c, 10559 Berlin Tel. +49-30 238 95-1806 Fax. +49-30 238 95-1810 ------------------------------------------------------------------------------------------- Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern Geschaeftsfuehrung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender) Dr. Walter Olthoff Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes Amtsgericht Kaiserslautern, HRB 2313 ------------------------------------------------------------------------------------------- _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
