Hi,

If you have specific requirements, you may use PatternTokenizer or 
CharTokenizer.fromSeparatorCharPredicate() as your tokenizer. To make an 
Analyzer out of it, use CustomAnalyzer. You have full flexibility! 

The tokenization by StandardTokenizer is according to Unicode standards, see 
Javadocs: "This class implements the Word Break rules from the Unicode Text 
Segmentation algorithm, as specified in Unicode Standard Annex #29."

As the stuff you have looks like filenames, you should make your own Analyzer, 
StandardAnalyzer with StandardTokenizer is made for text, not file names.

Uwe

-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -----Original Message-----
> From: sandesh.yapuram [mailto:sandesh.yapu...@seclore.com]
> Sent: Friday, July 28, 2017 11:26 AM
> To: dev@lucene.apache.org
> Subject: Add more stop characters to StandardAnalyzer
> 
> Hello,
> I am using lucene 6.3.0 and I am trying to index file names and allow search
> on them.
> I'm facing problem because StandardAnalyzer isn't giving me tokens as I was
> expecting.
> input:
>            mkt-4-elltvs-101_electrical_load_list.pdf
> 
> Expected output:
>            mkt
>            4
>            elltvs
>            101
>            electrical
>            load
>            list
>            pdf
> 
> Actual output:
>            mkt
>            4
>            elltvs
>            101_electrical_load_list.pdf
> 
> 
> So basically I want StandardAnalyzer to treat underscores(_) and periods(.)
> too as delimiters. Also I may have to add more delimiters in the future as
> per my testing observations.
> Which class do I need to edit/extend/rewrite to achieve this? Or is there
> any option to provide a list of delimiters?
> 
> Also the other analyzers I've tried - Classic, Shingle, WhiteSpace, Simple ;
> but none were close
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Add-
> more-stop-characters-to-StandardAnalyzer-tp4348048.html
> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to