Yes, by design. StandardAnalyzer implements "simple word boundaries" (the technical term is "Unicode text segmentation"), period. As the javadoc says, "As of Lucene version 3.1, this class implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29." That is a "standard".

See:
http://lucene.apache.org/core/4_0_0-ALPHA/analyzers-common/org/apache/lucene/analysis/standard/StandardTokenizer.html
http://lucene.apache.org/core/4_0_0-BETA/analyzers-common/org/apache/lucene/analysis/standard/ClassicTokenizer.html

-- Jack Krupansky

-----Original Message----- From: kiwi clive
Sent: Wednesday, October 24, 2012 6:42 AM
To: java-user@lucene.apache.org
Subject: StandardAnalyzer functionality change

Hi all,

Sorry if I'm asking an age old question but we have migrated to lucene 3.6.0 and I see StandardAnalyzer has changed its behaviour, particularly when tokenizing email addresses. From reading the forums, I understand StandardAnalyzer was renamed to ClassicAnalyzer - is this the case ?


If I pass the string 'u...@domain.com' through these analyzers, I get the following tokens:

Using StandardAnalyzer(Version.LUCENE_23):  -->  u...@domain.com (one token)

Using StandardAnalyzer(Version.LUCENE_36): --> user domain.com (two tokens) Using ClassicAnalyzer(Version.LUCENE_36): --> u...@domain.com (one token)

StandardAnalyzer is normally a good compromise as a default analyzer but the failure to keep an email address intact makes it less fit for purpose than it used to be. Is this a bug or is it by design ? If by design, what is the reason for the change and is ClassicAnalyzer now the defacto-analyzer to use ?

Thanks,
Clive

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to