mikemccand commented on code in PR #16086:
URL: https://github.com/apache/lucene/pull/16086#discussion_r3268156197
##########
lucene/core/src/java/org/apache/lucene/index/IndexWriterConfig.java:
##########
@@ -561,4 +561,21 @@ public IndexWriterConfig setParentField(String
parentField) {
this.parentField = parentField;
return this;
}
+
+ /**
+ * Expert: sets the interval in bytes between abort checks during merge
integrity verification.
+ * During merges, codec readers verify file integrity by reading entire
files to compute
+ * checksums. This setting controls how often the merge abort flag is
checked during these reads.
+ * Smaller values allow merges to be aborted more promptly but add slight
overhead.
+ *
+ * @param intervalBytes the interval in bytes, must be positive
+ */
+ public IndexWriterConfig setMergeAbortCheckIntervalBytes(int intervalBytes) {
Review Comment:
Let's make this `long`? In general if something is measuring bytes let's
try to use `long` -- we do math on such things that may lead to overflow?
##########
lucene/core/src/java/org/apache/lucene/codecs/lucene90/compressing/FieldsIndex.java:
##########
@@ -38,6 +39,9 @@ final long getStartPointer(int docID) {
/** Check the integrity of the index. */
abstract void checkIntegrity() throws IOException;
+ /** Check the integrity of the index, with periodic merge abort checking. */
+ public abstract void checkIntegrity(MergePolicy.AbortChecker abortChecker)
throws IOException;
Review Comment:
Since we are only adding "check every N bytes", can we rename `AbortChecker`
to something more specific? Is this a public API, or will Lucene users only
interact with the setter/getter on IWC (taking just int or long)?
##########
lucene/core/src/java/org/apache/lucene/codecs/CodecUtil.java:
##########
@@ -675,4 +698,38 @@ public static int readBEInt(DataInput in) throws
IOException {
public static long readBELong(DataInput in) throws IOException {
return (((long) readBEInt(in)) << 32) | (readBEInt(in) & 0xFFFFFFFFL);
}
+
+ /**
+ * A {@link BufferedChecksumIndexInput} that periodically checks if the
current merge should be
+ * aborted.
+ *
+ * <p>The check is performed in {@link #readBytes} since that is what {@link
+ * ChecksumIndexInput#seek} calls in a loop to compute the checksum.
+ */
+ private static final class AbortableBufferedChecksumIndexInput
Review Comment:
Hmm, you're adding another level of buffering / copying, when someone turns
on this new setting? I think? Can we subclass `FilterIndexInput` instead, so
it's simply/only the added accounting (tracking total bytes written) and not
more buffering / copying?
##########
lucene/core/src/java/org/apache/lucene/index/LiveIndexWriterConfig.java:
##########
@@ -117,6 +117,9 @@ public class LiveIndexWriterConfig {
/** The IndexWriter event listener to record key events * */
protected IndexWriterEventListener eventListener;
+ /** Interval in bytes between abort checks during merge integrity
verification. */
+ protected volatile int mergeAbortCheckIntervalBytes;
Review Comment:
Is it only `checkIntegrity` that we are newly instrumenting here?
##########
lucene/core/src/java/org/apache/lucene/index/LiveIndexWriterConfig.java:
##########
@@ -461,6 +464,14 @@ public IndexWriterEventListener
getIndexWriterEventListener() {
return eventListener;
}
+ /**
+ * Returns the interval in bytes between abort checks during merge integrity
verification. See
+ * {@link IndexWriterConfig#setMergeAbortCheckIntervalBytes(int)}.
+ */
+ public int getMergeAbortCheckIntervalBytes() {
Review Comment:
`long` instead?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]