WordDelimiterGraphFilter does not respect KeywordAttribute

Michael Sokolov Fri, 20 Apr 2018 06:42:17 -0700

I have a use case that generates some tokens containing punctuation
(fractions and other numerical constructs), but I am handling most
punctuation with WordDelimiterGraphFilter, which then decomposes those
tokens into parts and re-composes, so eg 1/2 becomes {1, 2, 12}. I thought
at first that I could mark those tokens as keywords to prevent any future
analysis, but I discovered WDGF ignores that.


I have a workaround using Arabic numerals as separators instead of
punctuation (1/2 -> 1١2) -- these are classified as digits, so WDGF does
not split on them --, but someday I would like to support Arabic (or Hindi)
language numbers as well, and then this hack will bite me.

Does it seem reasonable to update WDGF (and its cousin WDF) to respect
KeywordAttribute? I think it can be done with a very small change.

WordDelimiterGraphFilter does not respect KeywordAttribute

Reply via email to