Hi Robert, Yes, standardTokenizer solves my case... could you please explain the difference between ClassicalTokenizer and StandardTokenizer? How does standardTokenizer solve my case? I surf the web but I was unable to understand...
Any help is greatly appreciated. On Fri, Oct 20, 2017 at 12:10 AM, Robert Muir <rcm...@gmail.com> wrote: > easy, don't use classictokenizer: use standardtokenizer instead. > > On Thu, Oct 19, 2017 at 9:37 AM, Chitra <chithu.r...@gmail.com> wrote: > > Hi, > > I indexed a term 'ⒶeŘꝋꝒɫⱯŋɇ' (aeroplane) and the term was > > indexed as "er l n", some characters were trimmed while indexing. > > > > Here is my code > > > > protected Analyzer.TokenStreamComponents createComponents(final String > >> fieldName, final Reader reader) > >> { > >> final ClassicTokenizer src = new ClassicTokenizer(getVersion(), > >> reader); > >> src.setMaxTokenLength(ClassicAnalyzer.DEFAULT_MAX_ > TOKEN_LENGTH); > >> > >> TokenStream tok = new ClassicFilter(src); > >> tok = new LowerCaseFilter(getVersion(), tok); > >> tok = new StopFilter(getVersion(), tok, stopwords); > >> tok = new ASCIIFoldingFilter(tok); // to enable > AccentInsensitive > >> search > >> > >> return new Analyzer.TokenStreamComponents(src, tok) > >> { > >> @Override > >> protected void setReader(final Reader reader) throws > >> IOException > >> { > >> > >> src.setMaxTokenLength(ClassicAnalyzer.DEFAULT_MAX_TOKEN_LENGTH); > >> super.setReader(reader); > >> } > >> }; > >> } > > > > > > > > Am I missing anything? Is that expected behavior for my input or any > reason > > behind such abnormal behavior? > > > > -- > > Regards, > > Chitra > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Regards, Chitra