Punctuation handling in StandardTokenizer (and WikipediaTokenizer)
------------------------------------------------------------------

                 Key: LUCENE-1161
                 URL: https://issues.apache.org/jira/browse/LUCENE-1161
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Analysis
            Reporter: Grant Ingersoll
            Priority: Minor


It would be useful, in the StandardTokenizer, to be able to have more control 
over in-word punctuation is handled.  For instance, it is not always desirable 
to split on dashes or other punctuation.  In other cases, one may want to 
output the split tokens plus a collapsed version of the token that removes the 
punctuation.

For example, Solr's WordDelimiterFilter provides some nice capabilities here, 
but it can't do it's job when using the StandardTokenizer because the 
StandardTokenizer already makes the decision on how to handle it without giving 
the user any choice.

I think, in JFlex, we can have a back-compatible way of letting users make 
decisions about punctuation that occurs inside of a token.  Such as e-bay or 
i-pod, thus allowing for matches on iPod and eBay.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to