[I] WeightedSpanTermExtractor is fragile to modern query rewrites due to DelegatingLeafReader limitations [lucene]

via GitHub Thu, 05 Feb 2026 11:08:02 -0800


sjs004 opened a new issue, #15668:
URL: https://github.com/apache/lucene/issues/15668


   ### Description
   
   The `PlainHighlighter` via 
[WeightedSpanTermExtractor.java#L447](https://github.com/apache/lucene/blob/releases/lucene/10.3.2/lucene/highlighter/src/java/org/apache/lucene/search/highlight/WeightedSpanTermExtractor.java#L447)
 is prone to crashing when rewriting modern Lucene query types. This happens 
because the internal `DelegatingLeafReader` used for highlighting throws 
`UnsupportedOperationException` when `getFieldInfos()` is called. 
   
   **The Recurrent Issue**
   
   As Lucene adds more optimizations to query rewrite() methods (e.g., checking 
index statistics or DocValues existence), these queries begin to fail when 
processed by the highlighter.
   
   This pattern was previously observed with `FieldExistsQuery` (fixed in 
https://github.com/apache/lucene/pull/12088 by explicitly ignoring it). I am 
now observing the same crash in 
[OpenSearch](https://github.com/opensearch-project/OpenSearch/issues/20496) 
with `SortedNumericDocValuesRangeQuery` (via IndexOrDocValuesQuery), which 
attempts to access DocValuesSkipper and subsequently getFieldInfos() during 
rewrite
   
   This creates a "whack-a-mole" situation where every new query optimization 
potentially breaks the highlighter.
   
   **Example Failure**
   
   When highlighting a boolean query containing a range filter:
   
   ```
   Caused by: java.lang.UnsupportedOperationException
       at 
org.apache.lucene.search.highlight.WeightedSpanTermExtractor$DelegatingLeafReader.getFieldInfos(WeightedSpanTermExtractor.java:447)
       at 
org.apache.lucene.index.DocValuesSkipper.globalMinValue(DocValuesSkipper.java:137)
       at 
org.apache.lucene.document.SortedNumericDocValuesRangeQuery.rewrite(SortedNumericDocValuesRangeQuery.java:101)
   ```
   
   **Proposal & Discussion**
   
   I believe we should address this structurally rather than adding exceptions 
for every new query type.
   
   **Option 1**: The Structural Fix - Modify 
DelegatingLeafReader.getFieldInfos() to return FieldInfos instead of throwing 
an exception
   
   PS: I am not very familiar with lucene codebase & not so sure if this option 
is feasible or it may require a long time to fix
   
   **Option 2**: The Targeted Fix - Continue the pattern established in [PR 
#12088](https://github.com/apache/lucene/pull/12088) by adding 
IndexOrDocValuesQuery (and others) to the "ignore" list in 
`WeightedSpanTermExtractor.extract()` method
   
   I am happy to submit a PR for option 2 & maybe option 1 as well if team 
suggests some solution 
   
   ### Version and environment details
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] WeightedSpanTermExtractor is fragile to modern query rewrites due to DelegatingLeafReader limitations [lucene]

Reply via email to