Small correction: UAX29URLEmailAnalyzer = StandardAnalyzer + URL + Email. (Full support for URLs with the file:, ftp:, and http/s: protocols; full email support.)
ClassicAnalyzer is a different beast altogether. First of all, it doesn't implement Unicode segmentation - it has a non-standard tokenizer that works okay for some English text. It does recognize some (maybe most?) email addresses, but not all of them (e.g. the '+' character, a valid username char in email addresses, is not supported). It does not recognize URLs, but rather domain names, aka hostnames. Steve On Oct 24, 2012, at 3:52 PM, Jack Krupansky <j...@basetechnology.com> wrote: > I didn't explicitly say it, but ClassicAnalyzer does do exactly what you want > it to do - work break plus email and URL, or StandardAnalyzer plus email and > URL. > > -- Jack Krupansky > > -----Original Message----- From: kiwi clive > Sent: Wednesday, October 24, 2012 1:27 PM > To: java-user@lucene.apache.org > Subject: Re: StandardAnalyzer functionality change > > Thanks for the responses chaps, very informative, and most appreciated :-) > > > > > > ________________________________ > From: Ian Lea <ian....@gmail.com> > To: java-user@lucene.apache.org > Sent: Wednesday, October 24, 2012 4:19 PM > Subject: Re: StandardAnalyzer functionality change > > If you want email addresses, UAX29URLEmailAnalyzer is another alternative. > > > -- > Ian. > > > On Wed, Oct 24, 2012 at 3:56 PM, Jack Krupansky <j...@basetechnology.com> > wrote: >> Yes, by design. StandardAnalyzer implements "simple word boundaries" (the >> technical term is "Unicode text segmentation"), period. As the javadoc says, >> "As of Lucene version 3.1, this class implements the Word Break rules from >> the Unicode Text Segmentation algorithm, as specified in Unicode Standard >> Annex #29." That is a "standard". >> >> See: >> http://lucene.apache.org/core/4_0_0-ALPHA/analyzers-common/org/apache/lucene/analysis/standard/StandardTokenizer.html >> http://lucene.apache.org/core/4_0_0-BETA/analyzers-common/org/apache/lucene/analysis/standard/ClassicTokenizer.html >> >> -- Jack Krupansky >> >> -----Original Message----- From: kiwi clive >> Sent: Wednesday, October 24, 2012 6:42 AM >> To: java-user@lucene.apache.org >> Subject: StandardAnalyzer functionality change >> >> >> Hi all, >> >> Sorry if I'm asking an age old question but we have migrated to lucene 3.6.0 >> and I see StandardAnalyzer has changed its behaviour, particularly when >> tokenizing email addresses. From reading the forums, I understand >> StandardAnalyzer was renamed to ClassicAnalyzer - is this the case ? >> >> >> If I pass the string 'u...@domain.com' through these analyzers, I get the >> following tokens: >> >> Using StandardAnalyzer(Version.LUCENE_23): --> u...@domain.com (one token) >> >> Using StandardAnalyzer(Version.LUCENE_36): --> user domain.com (two >> tokens) >> Using ClassicAnalyzer(Version.LUCENE_36): --> u...@domain.com (one >> token) >> >> StandardAnalyzer is normally a good compromise as a default analyzer but the >> failure to keep an email address intact makes it less fit for purpose than >> it used to be. Is this a bug or is it by design ? If by design, what is the >> reason for the change and is ClassicAnalyzer now the defacto-analyzer to use >> ? >> >> Thanks, >> Clive >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org