Matt Post created JOSHUA-307:
--------------------------------
Summary: Java-based tokenization and normalization
Key: JOSHUA-307
URL: https://issues.apache.org/jira/browse/JOSHUA-307
Project: Joshua
Issue Type: Improvement
Reporter: Matt Post
Priority: Minor
Fix For: 6.2
Currently, Joshua expects data to be lowercased, normalized, and tokenized
consistent with the way the training data was prepared before being passed in.
This requires calling Perl scripts on the input data. It would be nice if these
Perl scripts (located under $JOSHUA/scripts/preparation) were rewritten in Java
(under org.apache.joshua.util) so that Joshua could do this normalization
itself. This would be particularly useful for the language packs.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)