Copilot commented on code in PR #9690: URL: https://github.com/apache/ozone/pull/9690#discussion_r2755166346
########## hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/utils/db/RangeQueryIndex.java: ########## @@ -0,0 +1,191 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hdds.utils.db; + +import java.io.IOException; +import java.util.Map; +import java.util.Objects; +import java.util.Set; +import java.util.TreeMap; + +/** + * An index for answering "does this point fall within any of these ranges?" efficiently. + * + * <p>The indexed ranges are <em>half-open</em> intervals of the form + * {@code [startInclusive, endExclusive)}. + * + * <p><strong>Core idea (sweep-line / prefix-sum over range boundaries):</strong> + * Instead of scanning every range on each query, this index stores a sorted map from + * boundary points to a running count of "active" ranges at that point. + * + * <ul> + * <li>For each range {@code [s, e)}, we add a delta {@code +1} at {@code s} and a delta + * {@code -1} at {@code e}.</li> + * <li>We then convert the deltas into a prefix sum in key order, so every boundary key + * stores the number of ranges active at that coordinate.</li> + * <li>For any query point {@code k}, the active count is {@code floorEntry(k).value}. + * If it is {@code > 0}, then {@code k} intersects at least one range.</li> + * </ul> + * + * <p><strong>Update model:</strong> this index supports only removing ranges that were part of the + * initial set. Removal updates the prefix sums for keys in {@code [startInclusive, endExclusive)} + * (net effect of removing {@code +1} at start and {@code -1} at end). + * + * <p><strong>Complexities:</strong> + * <ul> + * <li>Build: {@code O(R log B)} where {@code R} is #ranges and {@code B} is #distinct boundaries.</li> + * <li>{@link #containsIntersectingRange(Comparable)} (Object)}: {@code O(log B)}.</li> Review Comment: The complexity documentation contains a syntax error. The line states "{@link #containsIntersectingRange(Comparable)} (Object)}" but the "(Object)" appears to be a typo and should be removed. The correct format should be just "{@link #containsIntersectingRange(Comparable)}". ```suggestion * <li>{@link #containsIntersectingRange(Comparable)}: {@code O(log B)}.</li> ``` ########## hadoop-hdds/framework/src/test/java/org/apache/hadoop/hdds/utils/db/TestRDBBatchOperation.java: ########## @@ -51,10 +58,19 @@ import org.rocksdb.RocksDBException; /** - * The TestRDBBatchOperation class provides test cases to validate the functionality of RDB batch operations - * in a RocksDB-based backend. It verifies the correct behavior of write operations using batch processing - * and ensures the integrity of operations like put and delete when performed in batch mode. - */ + * Test class for verifying batch operations with delete ranges using the + * RDBBatchOperation and MockedConstruction of ManagedWriteBatch. + * + * This test class includes: + * - Mocking and tracking of operations including put, delete, and delete range + * within a batch operation. + * - Validation of committed operations using assertions on collected data. + * - Ensures that the batch operation interacts correctly with the + * RocksDatabase and ColumnFamilyHandle components. + * + * The test method includes: + * 1. Setup of mocked ColumnFamilyHandle and RocksDatabase.ColumnFamily. + * 2. Mocking of methods to track operations performed on*/ Review Comment: The JavaDoc comment for the TestRDBBatchOperation class is incomplete. The comment abruptly ends mid-sentence at "2. Mocking of methods to track operations performed on". This should be completed to provide a full description of what the test class does. ```suggestion * 2. Mocking of methods to track operations performed on the * ManagedWriteBatch and verification of the tracked operations after * the batch is committed. */ ``` ########## hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/utils/db/RangeQueryIndex.java: ########## @@ -0,0 +1,191 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hdds.utils.db; + +import java.io.IOException; +import java.util.Map; +import java.util.Objects; +import java.util.Set; +import java.util.TreeMap; + +/** + * An index for answering "does this point fall within any of these ranges?" efficiently. + * + * <p>The indexed ranges are <em>half-open</em> intervals of the form + * {@code [startInclusive, endExclusive)}. + * + * <p><strong>Core idea (sweep-line / prefix-sum over range boundaries):</strong> + * Instead of scanning every range on each query, this index stores a sorted map from + * boundary points to a running count of "active" ranges at that point. + * + * <ul> + * <li>For each range {@code [s, e)}, we add a delta {@code +1} at {@code s} and a delta + * {@code -1} at {@code e}.</li> + * <li>We then convert the deltas into a prefix sum in key order, so every boundary key + * stores the number of ranges active at that coordinate.</li> + * <li>For any query point {@code k}, the active count is {@code floorEntry(k).value}. + * If it is {@code > 0}, then {@code k} intersects at least one range.</li> + * </ul> + * + * <p><strong>Update model:</strong> this index supports only removing ranges that were part of the + * initial set. Removal updates the prefix sums for keys in {@code [startInclusive, endExclusive)} + * (net effect of removing {@code +1} at start and {@code -1} at end). + * + * <p><strong>Complexities:</strong> + * <ul> + * <li>Build: {@code O(R log B)} where {@code R} is #ranges and {@code B} is #distinct boundaries.</li> + * <li>{@link #containsIntersectingRange(Comparable)} (Object)}: {@code O(log B)}.</li> + * <li>{@link #removeRange(Range)}: {@code O(log B + K)} where {@code K} is #boundaries in the range.</li> + * </ul> + * + * @param <T> boundary type (must be {@link Comparable} to be stored in a {@link TreeMap}) + */ +class RangeQueryIndex<T extends Comparable<T>> { + + private final TreeMap<T, Integer> rangeCountIndexMap; + private final Set<Range<T>> ranges; + + RangeQueryIndex(Set<Range<T>> ranges) { + this.rangeCountIndexMap = new TreeMap<>(); + this.ranges = ranges; + init(); + } + + private void init() { + // Phase 1: store boundary deltas (+1 at start, -1 at end). + for (Range<T> range : ranges) { + rangeCountIndexMap.compute(range.startInclusive, (k, v) -> v == null ? 1 : v + 1); + rangeCountIndexMap.compute(range.endExclusive, (k, v) -> v == null ? -1 : v - 1); + } + + // Phase 2: convert deltas to prefix sums so each key holds the active range count at that coordinate. + int totalCount = 0; + for (Map.Entry<T, Integer> entry : rangeCountIndexMap.entrySet()) { + totalCount += entry.getValue(); + entry.setValue(totalCount); + } + } + + /** + * Remove a range from the index. + * + * <p>This method assumes the range set is "popped" over time (ranges are removed but not added). + * Internally, removing {@code [s, e)} decreases the active count by 1 for all boundary keys in + * {@code [s, e)} and leaves counts outside the range unchanged. + * + * @throws IOException if the given {@code range} is not part of the indexed set + */ + void removeRange(Range<T> range) throws IOException { + if (!ranges.contains(range)) { + throw new IOException(String.format("Range %s not found in index structure : %s", range, ranges)); + } + ranges.remove(range); + for (Map.Entry<T, Integer> entry : rangeCountIndexMap.subMap(range.startInclusive, true, + range.endExclusive, false).entrySet()) { + entry.setValue(entry.getValue() - 1); + } + } + + /** + * @return true iff {@code key} is contained in at least one indexed range. + * + * <p>Implementation detail: uses {@link TreeMap#floorEntry(Object)} to find the last boundary + * at or before {@code key}, and checks the prefix-summed active count at that point.</p> + */ + boolean containsIntersectingRange(T key) { + Map.Entry<T, Integer> countEntry = rangeCountIndexMap.floorEntry(key); + if (countEntry == null) { + return false; + } + return countEntry.getValue() > 0; + } + + /** + * Returns an intersecting range containing {@code key}, if any. + * + * <p>This method first checks {@link #containsIntersectingRange(Comparable)} using the index; + * if the count indicates an intersection exists, it then scans the backing {@link #ranges} + * set to find a concrete {@link Range} that contains {@code key}.</p> + * + * <p>Note that because {@link #ranges} is a {@link Set}, "first" refers to whatever iteration + * order that set provides (it is not guaranteed to be deterministic unless the provided set is).</p> + * + * @return a containing range, or null if none intersect + */ + Range<T> getFirstIntersectingRange(T key) { + Map.Entry<T, Integer> countEntry = rangeCountIndexMap.floorEntry(key); + if (countEntry == null) { + return null; + } + for (Range<T> range : ranges) { + if (range.contains(key)) { + return range; + } + } + return null; + } + + /** + * A half-open interval {@code [startInclusive, endExclusive)}. + * + * <p>For a value {@code k} to be contained, it must satisfy: + * {@code startInclusive <= k < endExclusive} (according to {@link Comparable#compareTo(Object)}).</p> + */ + static final class Range<T extends Comparable<T>> { + private final T startInclusive; + private final T endExclusive; + + Range(T startInclusive, T endExclusive) { + this.startInclusive = Objects.requireNonNull(startInclusive, "start == null"); + this.endExclusive = Objects.requireNonNull(endExclusive, "end == null"); + } + + @Override + public boolean equals(Object o) { + return this == o; + } + + @Override + public int hashCode() { + return Objects.hash(startInclusive, endExclusive); Review Comment: There is a violation of the equals-hashCode contract in the Range class. The equals method uses identity comparison (this == o), while hashCode computes the hash based on the values of startInclusive and endExclusive. This means two Range instances with identical bounds will have the same hashCode but will not be equal according to equals, which violates the Java contract that states: "If two objects are equal according to the equals method, then calling hashCode on each of them must produce the same integer result." While this appears to be intentional based on the test cases (e.g., testDuplicateSameBoundsDifferentInstances), it could lead to unexpected behavior when using Range objects in hash-based collections and makes the class fragile. Consider either making equals value-based to match hashCode, or making hashCode identity-based (e.g., using System.identityHashCode) to match equals. ```suggestion return System.identityHashCode(this); ``` ########## hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/utils/db/RangeQueryIndex.java: ########## @@ -0,0 +1,191 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hdds.utils.db; + +import java.io.IOException; +import java.util.Map; +import java.util.Objects; +import java.util.Set; +import java.util.TreeMap; + +/** + * An index for answering "does this point fall within any of these ranges?" efficiently. + * + * <p>The indexed ranges are <em>half-open</em> intervals of the form + * {@code [startInclusive, endExclusive)}. + * + * <p><strong>Core idea (sweep-line / prefix-sum over range boundaries):</strong> + * Instead of scanning every range on each query, this index stores a sorted map from + * boundary points to a running count of "active" ranges at that point. + * + * <ul> + * <li>For each range {@code [s, e)}, we add a delta {@code +1} at {@code s} and a delta + * {@code -1} at {@code e}.</li> + * <li>We then convert the deltas into a prefix sum in key order, so every boundary key + * stores the number of ranges active at that coordinate.</li> + * <li>For any query point {@code k}, the active count is {@code floorEntry(k).value}. + * If it is {@code > 0}, then {@code k} intersects at least one range.</li> + * </ul> + * + * <p><strong>Update model:</strong> this index supports only removing ranges that were part of the + * initial set. Removal updates the prefix sums for keys in {@code [startInclusive, endExclusive)} + * (net effect of removing {@code +1} at start and {@code -1} at end). + * + * <p><strong>Complexities:</strong> + * <ul> + * <li>Build: {@code O(R log B)} where {@code R} is #ranges and {@code B} is #distinct boundaries.</li> + * <li>{@link #containsIntersectingRange(Comparable)} (Object)}: {@code O(log B)}.</li> + * <li>{@link #removeRange(Range)}: {@code O(log B + K)} where {@code K} is #boundaries in the range.</li> + * </ul> + * + * @param <T> boundary type (must be {@link Comparable} to be stored in a {@link TreeMap}) + */ +class RangeQueryIndex<T extends Comparable<T>> { + + private final TreeMap<T, Integer> rangeCountIndexMap; + private final Set<Range<T>> ranges; + + RangeQueryIndex(Set<Range<T>> ranges) { + this.rangeCountIndexMap = new TreeMap<>(); + this.ranges = ranges; + init(); + } + + private void init() { + // Phase 1: store boundary deltas (+1 at start, -1 at end). + for (Range<T> range : ranges) { + rangeCountIndexMap.compute(range.startInclusive, (k, v) -> v == null ? 1 : v + 1); + rangeCountIndexMap.compute(range.endExclusive, (k, v) -> v == null ? -1 : v - 1); + } + + // Phase 2: convert deltas to prefix sums so each key holds the active range count at that coordinate. + int totalCount = 0; + for (Map.Entry<T, Integer> entry : rangeCountIndexMap.entrySet()) { + totalCount += entry.getValue(); + entry.setValue(totalCount); + } + } + + /** + * Remove a range from the index. + * + * <p>This method assumes the range set is "popped" over time (ranges are removed but not added). + * Internally, removing {@code [s, e)} decreases the active count by 1 for all boundary keys in + * {@code [s, e)} and leaves counts outside the range unchanged. + * + * @throws IOException if the given {@code range} is not part of the indexed set + */ + void removeRange(Range<T> range) throws IOException { + if (!ranges.contains(range)) { + throw new IOException(String.format("Range %s not found in index structure : %s", range, ranges)); + } + ranges.remove(range); + for (Map.Entry<T, Integer> entry : rangeCountIndexMap.subMap(range.startInclusive, true, + range.endExclusive, false).entrySet()) { + entry.setValue(entry.getValue() - 1); + } + } + + /** + * @return true iff {@code key} is contained in at least one indexed range. + * + * <p>Implementation detail: uses {@link TreeMap#floorEntry(Object)} to find the last boundary + * at or before {@code key}, and checks the prefix-summed active count at that point.</p> + */ + boolean containsIntersectingRange(T key) { + Map.Entry<T, Integer> countEntry = rangeCountIndexMap.floorEntry(key); + if (countEntry == null) { + return false; + } + return countEntry.getValue() > 0; + } + + /** + * Returns an intersecting range containing {@code key}, if any. + * + * <p>This method first checks {@link #containsIntersectingRange(Comparable)} using the index; + * if the count indicates an intersection exists, it then scans the backing {@link #ranges} + * set to find a concrete {@link Range} that contains {@code key}.</p> + * + * <p>Note that because {@link #ranges} is a {@link Set}, "first" refers to whatever iteration + * order that set provides (it is not guaranteed to be deterministic unless the provided set is).</p> + * + * @return a containing range, or null if none intersect + */ + Range<T> getFirstIntersectingRange(T key) { + Map.Entry<T, Integer> countEntry = rangeCountIndexMap.floorEntry(key); + if (countEntry == null) { + return null; + } + for (Range<T> range : ranges) { + if (range.contains(key)) { + return range; + } + } + return null; Review Comment: The documentation for getFirstIntersectingRange states that "This method first checks containsIntersectingRange using the index; if the count indicates an intersection exists, it then scans the backing ranges set", but the implementation doesn't actually verify that countEntry.getValue() > 0 before scanning the ranges set. This could lead to unnecessary O(n) scanning of the ranges set even when the index indicates no intersecting ranges exist. Consider adding a check like "if (countEntry.getValue() <= 0) { return null; }" after line 133 to match the documented behavior and improve performance. ########## hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/utils/db/RDBBatchOperation.java: ########## @@ -332,6 +468,11 @@ void delete(CodecBuffer key) { overwriteIfExists(new DeleteOp(key)); } + void deleteRange(byte[] startKey, byte[] endKey) { + delRangeCount++; + ops.put(opIndex.getAndIncrement(), new DeleteRangeOperation(startKey, endKey)); Review Comment: The deleteRange method doesn't update the batchSize field when adding a DeleteRangeOperation, unlike the put and delete methods which call overwriteIfExists and update batchSize accordingly. This inconsistency means that batch size tracking will be incomplete, potentially affecting batch operation monitoring and logging. Consider adding "batchSize += new DeleteRangeOperation(startKey, endKey).totalLength();" or restructuring to ensure batch size is properly tracked for delete range operations. ```suggestion DeleteRangeOperation op = new DeleteRangeOperation(startKey, endKey); batchSize += op.totalLength(); ops.put(opIndex.getAndIncrement(), op); ``` ########## hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/utils/db/RDBBatchOperation.java: ########## @@ -239,6 +251,40 @@ boolean closeImpl() { } } + /** + * Delete range operation to be applied to a {@link ColumnFamily} batch. + */ + private final class DeleteRangeOperation extends Op { Review Comment: DeleteRangeOperation should be made static, since the enclosing instance is not used. ```suggestion private static final class DeleteRangeOperation extends Op { ``` ########## pom.xml: ########## @@ -198,7 +198,7 @@ <protobuf.version>3.25.8</protobuf.version> <ranger.version>2.7.0</ranger.version> <!-- versions included in ratis-thirdparty, update in sync --> - <ratis-thirdparty.grpc.version>1.75.0</ratis-thirdparty.grpc.version> + <ratis-thirdparty.grpc.version>1.77.0</ratis-thirdparty.grpc.version> Review Comment: The gRPC version is being updated from 1.75.0 to 1.77.0, but this change is not mentioned in the PR description which focuses on optimizing search for deleted ranges. If this version bump is intentional and related to the changes in this PR, please update the PR description to explain why this dependency update is needed. If it's unrelated, consider moving it to a separate PR for clarity and to keep changes focused on a single purpose. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
