Hi Steve Thanks for the input. How to apply WordDelimiterGraphFilter / WordDelimiterFilter for email tokens alone using email regex ? i want to have only analyzed tokens for other tokens with other type of special characters...
-- Kumaran R On Thu, Jun 15, 2017 at 7:43 PM, Steve Rowe <sar...@gmail.com> wrote: > Hi Kumaran, > > WordDelimiterGraphFilter with PRESERVE_ORIGINAL should do what you want: < > http://lucene.apache.org/core/6_6_0/analyzers-common/ > org/apache/lucene/analysis/miscellaneous/WordDelimiterGraphFilter.html>. > > Here’s a test I added to TestWordDelimiterGraphFilter.java that passed > for me: > > ----- > public void testEmail() throws Exception { > final int flags = GENERATE_WORD_PARTS | GENERATE_NUMBER_PARTS | > SPLIT_ON_CASE_CHANGE | SPLIT_ON_NUMERICS | PRESERVE_ORIGINAL; > Analyzer a = new Analyzer() { > @Override public TokenStreamComponents createComponents(String field) { > Tokenizer tokenizer = new MockTokenizer(MockTokenizer.WHITESPACE, > false); > return new TokenStreamComponents(tokenizer, new > WordDelimiterGraphFilter(tokenizer, flags, null)); > } > }; > assertAnalyzesTo(a, "will.sm...@yahoo.com", > new String[] { "will.sm...@yahoo.com", "will", "smith", "yahoo", > "com" }, > null, null, null, > new int[] { 1, 0, 1, 1, 1 }, > null, false); > a.close(); > } > ----- > > -- > Steve > www.lucidworks.com > > > On Jun 15, 2017, at 8:53 AM, Kumaran Ramasubramanian <kums....@gmail.com> > wrote: > > > > Hi All, > > > > i want to index email fields as both analyzed and not analyzed using > custom > > analyzer. > > > > for example, > > sm...@yahoo.com > > will.sm...@yahoo.com > > > > that is, indexing sm...@yahoo.com as single token as well as analyzed > > tokens in same email field... > > > > > > My existing custom analyzer, > > > > public class CustomSearchAnalyzer extends StopwordAnalyzerBase > > { > > > > public CustomSearchAnalyzer(Version matchVersion, Reader stopwords) > > throws Exception > > { > > super(matchVersion, loadStopwordSet(stopwords, matchVersion)); > > } > > > > @Override > > protected Analyzer.TokenStreamComponents createComponents(final String > > fieldName, final Reader reader) > > { > > final ClassicTokenizer src = new ClassicTokenizer(getVersion(), > > reader); > > src.setMaxTokenLength(ClassicAnalyzer.DEFAULT_MAX_TOKEN_LENGTH); > > TokenStream tok = new ClassicFilter(src); > > tok = new LowerCaseFilter(getVersion(), tok); > > tok = new StopFilter(getVersion(), tok, stopwords); > > tok = new ASCIIFoldingFilter(tok); // to enable AccentInsensitive > > search > > > > return new Analyzer.TokenStreamComponents(src, tok) > > { > > @Override > > protected void setReader(final Reader reader) throws > IOException > > { > > > > src.setMaxTokenLength(ClassicAnalyzer.DEFAULT_MAX_TOKEN_LENGTH); > > super.setReader(reader); > > } > > }; > > } > > } > > > > > > And so i want to achieve like, > > > > 1.if i search using query "sm...@yahoo.com", records with > > will.sm...@yahoo.com should not come... > > 2.Also i should be able to search using query "smith" in that field > > 3.if possible, should be able to detect email values in all other fields > > and apply the same type of tokenization > > > > How to achieve point 1 and 2 using UAX29URLEmailTokenizer? how to add > > UAX29URLEmailTokenizer in my existing custom analyzer without using email > > analyzer ( perfieldanalyzer ) for email field.. And so i can apply this > > tokenizer for email terms of all fields.. > > > > > > > > - > > Kumaran R > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >