Hi,
I have a question about the tokenization performed by
WordDelimiterGraphFilter. I am not sure if this is a bug or maybe I am
missing some flags in setting up the GraphFilter. Please have a look.
Lucene version used is 6.6.1
Here is a gist code for it:
https://gist.github.com/parit/cecfd8f51c6d57a996d615ee82cb69a4#file-testanalyzer-java-L52
Input: cg7582pa
Expected tokens: cg7582pa <pos: 1> cg <pos: 0> 7582 <pos: 1>
7582pa<pos: 1> pa <pos: 2>
Observed: cg7582pa <pos: 1> cg <pos: 0> 7582 <pos: 1> pa <pos: 1>
Questions:
1. Why is the token 7582pa missing when I have set all the concatenation
flags?
2. Shouldn't the position of the first token i.e. cg7582pa be 0 instead
of 1 ?
3. Why is the last token i.e pa given a position of 2 and not 1 ?
Looking forward for your suggestions.
- Best
Parit Bansal