LakshSingla commented on code in PR #15243:
URL: https://github.com/apache/druid/pull/15243#discussion_r1391976371
##########
indexing-service/src/main/java/org/apache/druid/indexing/common/task/batch/parallel/TombstoneHelper.java:
##########
@@ -173,66 +188,66 @@ public Set<DataSegment>
computeTombstoneSegmentsForReplace(
* They should be aligned with the
replaceGranularity
* @param dataSource Datasource on which the replace is to be
performed
* @param replaceGranularity Granularity of the replace query
+ * @param maxBuckets Maximum number of partition buckets. If the
number of computed tombstone buckets
+ * exceeds this threshold, the method will throw
an error.
* @return Intervals computed for the tombstones
* @throws IOException
*/
public Set<Interval> computeTombstoneIntervalsForReplace(
List<Interval> intervalsToDrop,
List<Interval> intervalsToReplace,
String dataSource,
- Granularity replaceGranularity
+ Granularity replaceGranularity,
+ int maxBuckets
) throws IOException
{
Set<Interval> retVal = new HashSet<>();
List<Interval> usedIntervals =
getExistingNonEmptyIntervalsOfDatasource(intervalsToReplace, dataSource);
+ int buckets = 0;
for (Interval intervalToDrop : intervalsToDrop) {
for (Interval usedInterval : usedIntervals) {
Interval overlap = intervalToDrop.overlap(usedInterval);
- // No overlap of the dropped segment with the used interval due to
which we donot need to generate any tombstone
+ // No overlap of the dropped segment with the used interval due to
which we do not need to generate any tombstone
if (overlap == null) {
continue;
}
- // "overlap" might not be aligned with the if the used interval is not
aligned with the granularity of
- // the REPLACE i.e. datasource's original granularity and replace's
granularity are different
-
- // However, we align the boundaries of the overlap with the
replaceGranularity manually, in the following code.
-
- DateTime alignedIntervalStart =
replaceGranularity.bucketStart(overlap.getStart());
- long alignedIntervalStartMillis =
Math.max(alignedIntervalStart.getMillis(), JodaUtils.MIN_INSTANT);
- // If the start is aligned, then 'bucketStart()' is unchanged.
- // Else 'bucketStart()' will return the latest timestamp less than
overlap.getStart() which aligns with the REPLACE granularity.
-
- // That extra interval that we are adding before the overlap should be
contained in 'intervalToDrop' because
- // intervalToDrop is aligned by the replaceGranularity.
- // If the drop's interval is n, then the extra interval would start
from n + 1 (where 1 denotes the replaceGranularity)
- // The overlap's beginning would always be later than intervalToDrop
(trivially,
- // because it is the overlap) and if bucketStart floors the overlap
beginning, it cannot floor it before
- // the intervalToDrop's start
-
- // For example, if the replace granularity is DAY, intervalsToReplace
are 20/02/2023 - 24/02/2023 (always
- // aligned with the replaceGranularity), intervalsToDrop are
22/02/2023 - 24/02/2023 (they must also be aligned with the replaceGranularity)
- // If the relevant usedIntervals for the datasource are from
22/02/2023 01:00:00 - 23/02/2023 02:00:00, then
- // the overlap would be 22/02/2023 01:00:00 - 23/02/2023 02:00:00.
When iterating over the overlap we will get
- // the intervals from 22/02/2023 01:00:00 - 23/02/2023 02:00:00. After
aligning it would become
- // 22/02/2023T00:00:00Z - 23/02/2023T23:59:59Z
-
- // If the end is aligned, then we do not alter it, else we align the
end by geting the earliest time later
- // than the overlap's end which aligns with the replace granularity.
Using the above-mentioned logic for the
- // start time, we can also argue that the rounded up end would be
contained in the intervalToDrop
- DateTime alignedIntervalEnd;
- if
(replaceGranularity.bucketStart(overlap.getEnd()).equals(overlap.getEnd())) {
// Check if the end is aligned
- alignedIntervalEnd = overlap.getEnd();
+ if (Intervals.ETERNITY.getStart().equals(overlap.getStart())) {
+ // Generate a tombstone interval covering the negative eternity
interval.
+ buckets = validateAndGetBuckets(buckets + 1, maxBuckets);
+ retVal.add(new Interval(overlap.getStart(),
replaceGranularity.bucketStart(overlap.getEnd())));
+ } else if ((Intervals.ETERNITY.getEnd()).equals(overlap.getEnd())) {
+ // Generate a tombstone interval covering the positive eternity
interval.
+ buckets = validateAndGetBuckets(buckets + 1, maxBuckets);
+ retVal.add(new
Interval(replaceGranularity.bucketStart(overlap.getStart()), overlap.getEnd()));
} else {
- alignedIntervalEnd = replaceGranularity.bucketEnd(overlap.getEnd());
- }
- long alignedIntervalEndMillis =
Math.min(alignedIntervalEnd.getMillis(), JodaUtils.MAX_INSTANT);
- Interval alignedTombstoneInterval =
Intervals.utc(alignedIntervalStartMillis, alignedIntervalEndMillis);
+ // Overlap might not be aligned with the granularity if the used
interval is not aligned with the granularity
+ // However when fetching from the iterator, the first interval is
found using the bucketStart, which
+ // ensures that the interval is "rounded down" to align with the
granularity.
+ // Also, the interval would always be contained inside the
"intervalToDrop" because the original REPLACE
+ // is aligned by the granularity, and by extension all the elements
inside the intervals to drop would
+ // also be aligned by the same granularity (since intervalsToDrop =
replaceIntervals - publishIntervals, and
+ // the right-hand side is always aligned)
+ //
+ // For example, if replaceGranularity is DAY, intervalsToReplace is
[2023-02-20, 2023-02-24) (always
+ // aligned with the replaceGranularity), intervalsToDrop is
[2023-02-22, 2023-02-24) (they must also be aligned with the replaceGranularity)
+ // If the usedIntervals for the datasource is from
[2023-02-22T01:00:00Z, 2023-02-23T02:00:00Z), then
+ // the overlap would be [2023-02-22T01:00:00Z,
2023-02-23T02:00:00Z). When iterating over the overlap we will get
+ // the intervals from [2023-02-22, 2023-02-23) and [2023-02-23,
2023-02-24).
+ IntervalsByGranularity intervalsToDropByGranularity = new
IntervalsByGranularity(
+ ImmutableList.of(overlap),
+ replaceGranularity
+ );
- retVal.add(alignedTombstoneInterval);
+ // Helps in deduplication if required. Since all the intervals are
uniformly granular, there should be no
+ // no overlap post deduplication
+ HashSet<Interval> intervals =
Sets.newHashSet(intervalsToDropByGranularity.granularityIntervalsIterator());
Review Comment:
This is problematic I assume. If the overlap is `ETERNITY`,
`Sets.newHashSet(intervalsToDropByGranularity.granularityIntervalsIterator)`
would materialize all the buckets, which would OOM, unless there's a limit on
the iterator, which I missed.
The iterator should be limited, or we should walk through the iterator and
then pick up the items as we traverse.
I have added a comment about another test case, which I feel would OOM due
to this line
##########
extensions-core/multi-stage-query/src/test/java/org/apache/druid/msq/exec/MSQFaultsTest.java:
##########
@@ -403,6 +413,52 @@ public void testReplaceWithAppendAndSharedLocks()
}
}
+ @Test
+ public void testReplaceTombstonesWithTooManyBucketsThrowsFault()
+ {
+ RowSignature rowSignature = RowSignature.builder()
+ .add("__time", ColumnType.LONG)
+ .add("dim1", ColumnType.STRING)
+ .add("cnt",
ColumnType.LONG).build();
+
+ // Create a datasegment which lies partially outside the generated segment
+ DataSegment existingDataSegment = DataSegment.builder()
+
.interval(Intervals.of("2001-01-01T/2003-01-04T"))
Review Comment:
Can you also add a test case where this is eternity, the overwrite clause is
eternity and there is no WHERE clause.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]