DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUGĀ·
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=35971>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED ANDĀ·
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=35971

           Summary: StandardTokenizer has problems with comma-separated
                    values
           Product: Lucene
           Version: 1.4
          Platform: Other
        OS/Version: other
            Status: NEW
          Severity: minor
          Priority: P2
         Component: Analysis
        AssignedTo: [email protected]
        ReportedBy: [EMAIL PROTECTED]


The StandardTokenizer assumes that if a phrase contains a comma and at least one
digit, the phrase has to be a number. We are trying to index comma-separated
values of SAP R/3 trancation codes along with standard text. Many of these code
contain digits, e.g. "VA01" or "SE80". While tokenizing text containing these
codes, lucene recognizes a comma-separated list of them as a digit, e.g.
"VA01,VA02,VA03". The grammar should be modified to recognize numbers correctly
(e.g. containing only digits).

-- 
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to