mikemccand commented on PR #16086:
URL: https://github.com/apache/lucene/pull/16086#issuecomment-4554027055
> For vectors, I wonder if more `checkAborted()` calls could be added too
when building large graphs during merges.
Actually does vectors merging today ever check for aborted, e.g. during the
concurrent HNSW graph building, except at the very start/end?
> > I'm also not a fan of only doing stored fields -- other Lucene index
files can be ginormous (vectors) -- can we impl for those as well?
>
> Yes I mentioned in the PR description that other files could benefit from
this but wanted to keep the radius of changes low at first.
Got it, +1 for PnP ("progress not perfection") for sure.
I think a lower level approach (wrapping all `IndexOutput` created during
merge, since we already wrap `Directory` to track which files exist, and
checking every N seconds) might apply broadly (vectors, stored fields, massive
postings files or doc values) with smaller blast radius on API surface.
> > For merges, `IndexWriter` already wraps the provided `Directory` to
track which files are created for Codec file-tracking purposes, I think? If so,
we could in theory wrap that to also instrument every `IndexInput` with this
"check every N seconds", maybe.
>
> I explored the `MergeScheduler#wrapForMerge` path but noticed that many
readers were not reopening the files/inputs for integrity checks but instead
use clones of already opened inputs, making it hard to wrap the index inputs.
The typical pattern seems to be reader -> reader merge instance > clone inputs
-> seek to zero (then skip by reading to compute the checksum).
Hmm I see, that is a wrinkle. I was thinking we could wrap the output path,
but you're right, `checkIntegrity` is doing tons of reading (from existing open
thingy) and no writing.
> I also explored wrapping the readers (`StoredFieldsReader`,
`DocValuesProducer etc`) using `MergePolicy.OneMerge#wrapForMerge(CodecReader)`
to override the checkIntegry methods but noticed that it would disable some
bulk merges optimizations down the road.
That is so tricky -- how did you catch that bulk optos broke? I hope we
have tests that get angry?
> I think we could start with option 2 as a simple first improvement in this
PR (doesn't interrupt in-progress checksums but avoids starting new ones on an
already-aborted merge) and follow up with option 1 or 3 to interrupt
long-running checksums.
+1 to at least start with #2 -- that's an easy win!
I also ... dare say ... is there any chance `Thread.interrupt` is usable
again? The wildly unexpected "I close your file handle because you interrupted
me while reading some bytes from it" side effect of `Thread.interrupt` is maybe
less of a big deal now (memory-mapped segments don't have this problem I
think)? `Thread.interrupt` should work for threads stuck deep in IO for
`checkIntegrity`, except, now you need to know which thread to interrupt.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]