easy, don't use classictokenizer: use standardtokenizer instead. On Thu, Oct 19, 2017 at 9:37 AM, Chitra <chithu.r...@gmail.com> wrote: > Hi, > I indexed a term 'ⒶeŘꝋꝒɫⱯŋɇ' (aeroplane) and the term was > indexed as "er l n", some characters were trimmed while indexing. > > Here is my code > > protected Analyzer.TokenStreamComponents createComponents(final String >> fieldName, final Reader reader) >> { >> final ClassicTokenizer src = new ClassicTokenizer(getVersion(), >> reader); >> src.setMaxTokenLength(ClassicAnalyzer.DEFAULT_MAX_TOKEN_LENGTH); >> >> TokenStream tok = new ClassicFilter(src); >> tok = new LowerCaseFilter(getVersion(), tok); >> tok = new StopFilter(getVersion(), tok, stopwords); >> tok = new ASCIIFoldingFilter(tok); // to enable AccentInsensitive >> search >> >> return new Analyzer.TokenStreamComponents(src, tok) >> { >> @Override >> protected void setReader(final Reader reader) throws >> IOException >> { >> >> src.setMaxTokenLength(ClassicAnalyzer.DEFAULT_MAX_TOKEN_LENGTH); >> super.setReader(reader); >> } >> }; >> } > > > > Am I missing anything? Is that expected behavior for my input or any reason > behind such abnormal behavior? > > -- > Regards, > Chitra
--------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org