[jira] [Commented] (JOSHUA-145) Add truecasing

Matt Post (JIRA) Mon, 02 May 2016 14:14:05 -0700

    [ 
https://issues.apache.org/jira/browse/JOSHUA-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15267521#comment-15267521
 ]


Matt Post commented on JOSHUA-145:
----------------------------------

Reclassified.

I recently added a related feature to Joshua. If you invoke the decoder with 
-lowercase, all the input sentence tokens will be lowercased, and the grammar 
lookups will used the lowercase version. It then adds an annotation on each 
token of the form

    lettercase = {lower, upper, all-upper}

This is available to any feature function, for example. If you also invoke the 
decoder with "-project-case", it will use word-level alignments to project 
source-language case to the target language, according to the following logic:

- If aligned to the first word, case is only projected if it is "all-upper"
- Otherwise, project the source-language case

This does things like project all caps, and capitalization of names (including 
if they were OOVs). It's different from true-casing or re-casing. I haven't 
done a thorough comparison, but this was the method that helped put a 
relatively simple Joshua system in first place for WMT 2016 en-tr.

> Add truecasing
> --------------
>
>                 Key: JOSHUA-145
>                 URL: https://issues.apache.org/jira/browse/JOSHUA-145
>             Project: Joshua
>          Issue Type: New Feature
>            Reporter: Matt Post
>            Assignee: Matt Post
>             Fix For: 6.1
>
>
> Joshua currently lowercases all data; a better approach is truecasing, where 
> the most frequent capitalization pattern is used for each token.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (JOSHUA-145) Add truecasing

Reply via email to