Hi,

I have a question about the tokenization performed by WordDelimiterGraphFilter. I am not sure if this is a bug or maybe I am missing some flags in setting up the GraphFilter. Please have a look.  Lucene version used is 6.6.1

Here is a gist code for it: https://gist.github.com/parit/cecfd8f51c6d57a996d615ee82cb69a4#file-testanalyzer-java-L52

Input: cg7582pa

Expected tokens:  cg7582pa <pos: 1> cg <pos: 0> 7582 <pos: 1> 7582pa<pos: 1> pa <pos: 2>

Observed: cg7582pa <pos: 1> cg <pos: 0> 7582 <pos: 1> pa <pos: 1>

Questions:

1. Why is the token 7582pa missing when I have set all the concatenation flags?

2. Shouldn't the position of the first token i.e. cg7582pa be 0 instead of 1 ?

3. Why is the last token i.e pa given a position of 2 and not 1 ?

Looking forward for your suggestions.

- Best

Parit Bansal



Reply via email to