Dubious tokenizing with WordDelimiterGraphFilter

Parit Bansal Mon, 22 Jan 2018 02:29:09 -0800

Hi,

I have a question about the tokenization performed byWordDelimiterGraphFilter. I am not sure if this is a bug or maybe I ammissing some flags in setting up the GraphFilter. Please have a look. Lucene version used is 6.6.1

Here is a gist code for it:https://gist.github.com/parit/cecfd8f51c6d57a996d615ee82cb69a4#file-testanalyzer-java-L52


Input: cg7582pa

Expected tokens: cg7582pa <pos: 1> cg <pos: 0> 7582 <pos: 1>7582pa<pos: 1> pa <pos: 2>


Observed: cg7582pa <pos: 1> cg <pos: 0> 7582 <pos: 1> pa <pos: 1>

Questions:

1. Why is the token 7582pa missing when I have set all the concatenationflags?

2. Shouldn't the position of the first token i.e. cg7582pa be 0 insteadof 1 ?


3. Why is the last token i.e pa given a position of 2 and not 1 ?

Looking forward for your suggestions.

- Best

Parit Bansal

Dubious tokenizing with WordDelimiterGraphFilter

Reply via email to