[jira] [Updated] (OPENNLP-809) Detokenize instead of splitting string with whitespaces

Joern Kottmann (JIRA) Wed, 27 Apr 2016 01:43:52 -0700

     [ 
https://issues.apache.org/jira/browse/OPENNLP-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Joern Kottmann updated OPENNLP-809:
-----------------------------------
    Priority: Major  (was: Critical)

> Detokenize instead of splitting string with whitespaces
> -------------------------------------------------------
>
>                 Key: OPENNLP-809
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-809
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Name Finder
>    Affects Versions: 1.6.0
>            Reporter: Damiano
>            Assignee: Joern Kottmann
>
> Hello,
> I do not understand why you are splitting the tokens with a whitespace in 
> RegexNameFinder. It is pointless to me. 
> When we call `find(String[] token)` you rebuilt the string by appending a 
> whitespace at the end of each token. Why?
> I am saying that because maybe the original string has been tokenized by the 
> *SimpleTokenizer*, and, as you know this tokenizer adds (for example) a 
> whitespace within a *word* and a *point*. Example:
> Original:
> I am visiting Rome.
> Tokenized:
> I am visiting Rome*[SPLIT]*.
> Regex is applied to: 
> I am visiting Rome . 
> (instead of the original)
> In this version you have introduced a find() method that allows a String 
> instead of String[], but in this case someone pass the original string not 
> the rebuilt string, so the result are different.
> Why do not apply a *detokenize* method to do the *EXACT* inverse operation of 
> the tokenization? (and get the original string again instead of a modified 
> string)
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (OPENNLP-809) Detokenize instead of splitting string with whitespaces

Reply via email to