[jira] [Commented] (JOSHUA-307) Java-based tokenization and normalization

Kellen Sunderland (JIRA) Mon, 29 Aug 2016 14:33:05 -0700

    [ 
https://issues.apache.org/jira/browse/JOSHUA-307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15447105#comment-15447105
 ]


Kellen Sunderland commented on JOSHUA-307:
------------------------------------------

+1.  This would be great, and could go into the CLI module.

> Java-based tokenization and normalization
> -----------------------------------------
>
>                 Key: JOSHUA-307
>                 URL: https://issues.apache.org/jira/browse/JOSHUA-307
>             Project: Joshua
>          Issue Type: Improvement
>            Reporter: Matt Post
>            Priority: Minor
>             Fix For: 6.2
>
>
> Currently, Joshua expects data to be lowercased, normalized, and tokenized 
> consistent with the way the training data was prepared before being passed 
> in. This requires calling Perl scripts on the input data. It would be nice if 
> these Perl scripts (located under $JOSHUA/scripts/preparation) were rewritten 
> in Java (under org.apache.joshua.util) so that Joshua could do this 
> normalization itself. This would be particularly useful for the language 
> packs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (JOSHUA-307) Java-based tokenization and normalization

Reply via email to