I've run into an issue where I think I'd like to use the PositionFilter plus RemoveDuplicatesFilter to deduplicate tokens, effectively removing the impact of term frequency for a specific field without having to convince my client to accept a java plugin (phrase queries don't matter in this case).
I realized that PositionFilter is deprecated, as per this Jira issue: https://issues.apache.org/jira/browse/LUCENE-4981 The best justification I can find for this deprecation is this invariant stated in the Jira issue: >There are invariants that need to be maintained by token filters: all tokens that start at the same position must have the same start offset and all tokens that end at the same position (start position + position length) must have the same end offset (see ValidatingFilter). By arbitrarily changing position increments, PositionFilter breaks these invariants. I question this invariant I can see why this invariant is important for several features, such as highlighting. On the other hand, its extremely common to copy fields to have alternate analysis run on them (ie solr copyFields). These fields will only ever be indexed and never displayed to the user. Does this invariant still matter in this case? I could see adjusting offsets in an analyzer. However, I feel like offsets are a bit sacrosanct -- they refer to a character offset in the original document -- not the result of analysis. Am I wrong in feeling this way? So I question why PositionFilter was deprecated. It feels like the invariant makes sense for any field displayed to users, but many times we create fields with different analyzer chains that don't need to concern themselves with features that care about the sanity of the token graph. It seems this should be a decision left up to developers. Thoughts? Cheers, -- Doug Turnbull Search & Big Data Architect OpenSource Connections <http://o19s.com>