anuragrai16 opened a new issue, #17877:
URL: https://github.com/apache/pinot/issues/17877
Lucene-based indexes (TEXT_MATCH and HNSW vector) might have gaps in query
resource tracking and cancellation that can lead to uncontrolled memory
consumption and potential index corruption during OOM-triggered query killing.
When the OOM killer in _QueryResourceAggregator_ decides to terminate a query,
it calls _QueryExecutionContext.terminate()_ which sets a TerminationException
and calls _Future.cancel(true)_ on registered tasks. This mechanism works well
for standard operators but has issues with Lucene based indexes.
**Issue 1: Immutable Lucene Collectors Lack Termination Checks**
LuceneDocIdCollector, HnswDocIdCollector Current Logic:
`public void collect(int doc) throws IOException {
_docIds.add(_docIdTranslator.getPinotDocId(context.docBase + doc));
}`
The collectors have no mechanism to check for query termination during
document collection. A query scanning millions of documents will continue until
completion, ignoring OOM kill signals. This needs to replicate the
QueryThreadContext logic as,
`private static final int CHECK_INTERVAL_MASK = 0x1FFF; // Check every 8192
docs
private int _docsCollected = 0;
public void collect(int doc) throws IOException {
if ((_docsCollected++ & CHECK_INTERVAL_MASK) == 0) {
TerminationException ex = QueryThreadContext.getTerminateException();
if (ex != null) throw new CollectionTerminatedException();
QueryThreadContext.sampleUsage();
}
_docIds.add(_docIdTranslator.getPinotDocId(context.docBase + doc));
}`
**Issue 2: Realtime Lucene Search Threads Are Invisible to OOM Killer**
Realtime text searches run in _RealtimeLuceneTextIndexSearcherPool_, a
separate thread pool without _QueryThreadContext_ propagation. The OOM killer
cannot see memory allocated by these threads, it only sees the worker thread
blocked on _future.get()_ with minimal resource usage.
Prospective fix :
`Callable<MutableRoaringBitmap> searchCallable = () -> {
if (parentExecutionContext != null && parentAccountant != null) {
try (QueryThreadContext ignored = QueryThreadContext.open(
parentExecutionContext, parentAccountant)) {
return executeSearch(searchQuery, docIDCollector);
}
}
return executeSearch(searchQuery, docIDCollector);
};`
We cannot interrupt the Realtime Lucene IndexSearcher the same way we
interrupt other worker threads, so we need special handling mentioned in Issue
1 to fix this. We need to fix the _RealtimeLuceneDocIdCollector_ with,
`public void collect(int doc) throws IOException {
if (_shouldCancel) throw new CollectionTerminatedException();
if ((_docsCollected++ & CHECK_INTERVAL_MASK) == 0) {
if (QueryThreadContext.getTerminateException() != null) {
throw new CollectionTerminatedException();
}
QueryThreadContext.sampleUsage();
}
_docIds.add(context.docBase + doc);
}`
**Issue 3 : MutableVectorIndex Vulnerable to Index Corruption**
MutableVectorIndex is doing synchronous Lucene search while having an active
IndexWriter. If the calling thread is interrupted during search due to OOM
killing, the index can be corrupted.
It needs to apply the same async pattern as `RealtimeLuceneTextIndex`,
execute search in a separate thread pool, propagate context for tracking, and
use cooperative cancellation via flags rather than thread interrupts.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]