[
https://issues.apache.org/jira/browse/JOSHUA-307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15447105#comment-15447105
]
Kellen Sunderland commented on JOSHUA-307:
------------------------------------------
+1. This would be great, and could go into the CLI module.
> Java-based tokenization and normalization
> -----------------------------------------
>
> Key: JOSHUA-307
> URL: https://issues.apache.org/jira/browse/JOSHUA-307
> Project: Joshua
> Issue Type: Improvement
> Reporter: Matt Post
> Priority: Minor
> Fix For: 6.2
>
>
> Currently, Joshua expects data to be lowercased, normalized, and tokenized
> consistent with the way the training data was prepared before being passed
> in. This requires calling Perl scripts on the input data. It would be nice if
> these Perl scripts (located under $JOSHUA/scripts/preparation) were rewritten
> in Java (under org.apache.joshua.util) so that Joshua could do this
> normalization itself. This would be particularly useful for the language
> packs.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)