Yes, by design. StandardAnalyzer implements "simple word boundaries" (the
technical term is "Unicode text segmentation"), period. As the javadoc says,
"As of Lucene version 3.1, this class implements the Word Break rules from
the Unicode Text Segmentation algorithm, as specified in Unicode Standard
Annex #29." That is a "standard".
See:
http://lucene.apache.org/core/4_0_0-ALPHA/analyzers-common/org/apache/lucene/analysis/standard/StandardTokenizer.html
http://lucene.apache.org/core/4_0_0-BETA/analyzers-common/org/apache/lucene/analysis/standard/ClassicTokenizer.html
-- Jack Krupansky
-----Original Message-----
From: kiwi clive
Sent: Wednesday, October 24, 2012 6:42 AM
To: [email protected]
Subject: StandardAnalyzer functionality change
Hi all,
Sorry if I'm asking an age old question but we have migrated to lucene 3.6.0
and I see StandardAnalyzer has changed its behaviour, particularly when
tokenizing email addresses. From reading the forums, I understand
StandardAnalyzer was renamed to ClassicAnalyzer - is this the case ?
If I pass the string '[email protected]' through these analyzers, I get the
following tokens:
Using StandardAnalyzer(Version.LUCENE_23): --> [email protected] (one token)
Using StandardAnalyzer(Version.LUCENE_36): --> user domain.com (two
tokens)
Using ClassicAnalyzer(Version.LUCENE_36): --> [email protected] (one
token)
StandardAnalyzer is normally a good compromise as a default analyzer but the
failure to keep an email address intact makes it less fit for purpose than
it used to be. Is this a bug or is it by design ? If by design, what is the
reason for the change and is ClassicAnalyzer now the defacto-analyzer to use
?
Thanks,
Clive
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]