[ https://issues.apache.org/jira/browse/JOSHUA-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15267521#comment-15267521 ]
Matt Post commented on JOSHUA-145: ---------------------------------- Reclassified. I recently added a related feature to Joshua. If you invoke the decoder with -lowercase, all the input sentence tokens will be lowercased, and the grammar lookups will used the lowercase version. It then adds an annotation on each token of the form lettercase = {lower, upper, all-upper} This is available to any feature function, for example. If you also invoke the decoder with "-project-case", it will use word-level alignments to project source-language case to the target language, according to the following logic: - If aligned to the first word, case is only projected if it is "all-upper" - Otherwise, project the source-language case This does things like project all caps, and capitalization of names (including if they were OOVs). It's different from true-casing or re-casing. I haven't done a thorough comparison, but this was the method that helped put a relatively simple Joshua system in first place for WMT 2016 en-tr. > Add truecasing > -------------- > > Key: JOSHUA-145 > URL: https://issues.apache.org/jira/browse/JOSHUA-145 > Project: Joshua > Issue Type: New Feature > Reporter: Matt Post > Assignee: Matt Post > Fix For: 6.1 > > > Joshua currently lowercases all data; a better approach is truecasing, where > the most frequent capitalization pattern is used for each token. -- This message was sent by Atlassian JIRA (v6.3.4#6332)