[
https://issues.apache.org/jira/browse/OPENNLP-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joern Kottmann updated OPENNLP-809:
-----------------------------------
Priority: Major (was: Critical)
> Detokenize instead of splitting string with whitespaces
> -------------------------------------------------------
>
> Key: OPENNLP-809
> URL: https://issues.apache.org/jira/browse/OPENNLP-809
> Project: OpenNLP
> Issue Type: Bug
> Components: Name Finder
> Affects Versions: 1.6.0
> Reporter: Damiano
> Assignee: Joern Kottmann
>
> Hello,
> I do not understand why you are splitting the tokens with a whitespace in
> RegexNameFinder. It is pointless to me.
> When we call `find(String[] token)` you rebuilt the string by appending a
> whitespace at the end of each token. Why?
> I am saying that because maybe the original string has been tokenized by the
> *SimpleTokenizer*, and, as you know this tokenizer adds (for example) a
> whitespace within a *word* and a *point*. Example:
> Original:
> I am visiting Rome.
> Tokenized:
> I am visiting Rome*[SPLIT]*.
> Regex is applied to:
> I am visiting Rome .
> (instead of the original)
> In this version you have introduced a find() method that allows a String
> instead of String[], but in this case someone pass the original string not
> the rebuilt string, so the result are different.
> Why do not apply a *detokenize* method to do the *EXACT* inverse operation of
> the tokenization? (and get the original string again instead of a modified
> string)
> Thanks.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)