Re: email field - analyzed and not analyzed in single field using custom analyzer

Kumaran Ramasubramanian Mon, 19 Jun 2017 20:53:43 -0700

Hi Steve

Thanks for the input. How to apply WordDelimiterGraphFilter
/ WordDelimiterFilter for email tokens alone using email regex ? i want to
have only analyzed tokens for other tokens with other type of special
characters...



--
Kumaran R






On Thu, Jun 15, 2017 at 7:43 PM, Steve Rowe <sar...@gmail.com> wrote:

> Hi Kumaran,
>
> WordDelimiterGraphFilter with PRESERVE_ORIGINAL should do what you want: <
> http://lucene.apache.org/core/6_6_0/analyzers-common/
> org/apache/lucene/analysis/miscellaneous/WordDelimiterGraphFilter.html>.
>
> Here’s a test I added to TestWordDelimiterGraphFilter.java that passed
> for me:
>
> -----
> public void testEmail() throws Exception {
>   final int flags = GENERATE_WORD_PARTS | GENERATE_NUMBER_PARTS |
> SPLIT_ON_CASE_CHANGE | SPLIT_ON_NUMERICS | PRESERVE_ORIGINAL;
>   Analyzer a = new Analyzer() {
>     @Override public TokenStreamComponents createComponents(String field) {
>       Tokenizer tokenizer = new MockTokenizer(MockTokenizer.WHITESPACE,
> false);
>       return new TokenStreamComponents(tokenizer, new
> WordDelimiterGraphFilter(tokenizer, flags, null));
>     }
>   };
>   assertAnalyzesTo(a, "will.sm...@yahoo.com",
>       new String[] { "will.sm...@yahoo.com", "will", "smith", "yahoo",
> "com" },
>       null, null, null,
>       new int[] { 1, 0, 1, 1, 1 },
>       null, false);
>   a.close();
> }
> -----
>
> --
> Steve
> www.lucidworks.com
>
> > On Jun 15, 2017, at 8:53 AM, Kumaran Ramasubramanian <kums....@gmail.com>
> wrote:
> >
> > Hi All,
> >
> > i want to index email fields as both analyzed and not analyzed using
> custom
> > analyzer.
> >
> > for example,
> > sm...@yahoo.com
> > will.sm...@yahoo.com
> >
> > that is,  indexing sm...@yahoo.com as single token as well as analyzed
> > tokens in same email field...
> >
> >
> > My existing custom analyzer,
> >
> > public class CustomSearchAnalyzer extends StopwordAnalyzerBase
> > {
> >
> >    public CustomSearchAnalyzer(Version matchVersion, Reader stopwords)
> > throws Exception
> >    {
> >        super(matchVersion, loadStopwordSet(stopwords, matchVersion));
> >    }
> >
> >    @Override
> >    protected Analyzer.TokenStreamComponents createComponents(final String
> > fieldName, final Reader reader)
> >    {
> >        final ClassicTokenizer src = new ClassicTokenizer(getVersion(),
> > reader);
> >        src.setMaxTokenLength(ClassicAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
> >        TokenStream tok = new ClassicFilter(src);
> >        tok = new LowerCaseFilter(getVersion(), tok);
> >        tok = new StopFilter(getVersion(), tok, stopwords);
> >        tok = new ASCIIFoldingFilter(tok); // to enable AccentInsensitive
> > search
> >
> >        return new Analyzer.TokenStreamComponents(src, tok)
> >        {
> >            @Override
> >            protected void setReader(final Reader reader) throws
> IOException
> >            {
> >
> > src.setMaxTokenLength(ClassicAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
> >                super.setReader(reader);
> >            }
> >        };
> >    }
> > }
> >
> >
> > And so i want to achieve like,
> >
> > 1.if i search using query "sm...@yahoo.com", records with
> > will.sm...@yahoo.com should not come...
> > 2.Also i should be able to search using query "smith" in that field
> > 3.if possible, should be able to detect email values in all other fields
> > and apply the same type of tokenization
> >
> > How to achieve point 1 and 2 using UAX29URLEmailTokenizer? how to add
> > UAX29URLEmailTokenizer in my existing custom analyzer without using email
> > analyzer ( perfieldanalyzer )  for email field.. And so i can apply this
> > tokenizer for email terms of all fields..
> >
> >
> >
> > -
> > Kumaran R
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: email field - analyzed and not analyzed in single field using custom analyzer

Reply via email to