maytasm commented on a change in pull request #10371:
URL: https://github.com/apache/druid/pull/10371#discussion_r488390448
##########
File path:
server/src/main/java/org/apache/druid/server/coordinator/duty/CompactSegments.java
##########
@@ -238,25 +272,102 @@ private CoordinatorStats makeStats(int
numCompactionTasks, CompactionSegmentIter
{
final CoordinatorStats stats = new CoordinatorStats();
stats.addToGlobalStat(COMPACTION_TASK_COUNT, numCompactionTasks);
- totalSizesOfSegmentsAwaitingCompactionPerDataSource =
iterator.totalRemainingSegmentsSizeBytes();
-
totalSizesOfSegmentsAwaitingCompactionPerDataSource.object2LongEntrySet().fastForEach(
- entry -> {
- final String dataSource = entry.getKey();
- final long totalSizeOfSegmentsAwaitingCompaction =
entry.getLongValue();
- stats.addToDataSourceStat(
- TOTAL_SIZE_OF_SEGMENTS_AWAITING_COMPACTION,
- dataSource,
- totalSizeOfSegmentsAwaitingCompaction
- );
- }
- );
+
+ // Make sure that the iterator iterate through all the remaining segments
so that we can get accurate and correct
+ // statistics (remaining, skipped, processed, etc.). The reason we have to
do this explicitly here is because
+ // earlier (when we are iterating to submit compaction tasks) we may have
ran out of task slot and were not able
+ // to iterate to the first segment that needs compaction for some
datasource.
+ iterator.flushAllSegments();
Review comment:
I also thought of having two implementation of statistic calculation. We
can have the old method which is just a sum of all segments from the earliest
to the segment that we iterated up to. We can call this approximation and the
downside is that it can overestimate if there are compacted intervals between
the earliest to the segment that we iterated up to. Then we can have this new
implementation which is an exact calculation (i.e. we check each and every
interval). Each datasource can choose which implementation to use for its stats
calculation.
However, I discussed this with @jihoonson and we think that this new
implementation is actually ok to use as a default for all datasources. This
code is run as part of Coordinator duty hence it is not on critical path and
there is no client waiting, etc. While it is going through all segments, it is
doing this at a interval level (we do batch of segments for each interval at a
time). Thus, the looping is for every interval of every datasources (not every
segments). Also, the auto compaction is run every 30 mins by default. This
should be plenty of time for this code to finish before the next cycle begins.
Furthermore, we can move the auto compaction duty to the last of the set of
Coordinator duty and change the scheduling of the Coordinator duty from
scheduleWithFixedDelay to scheduleWithFixedRate. This should ensures that the
other duty still get run at every 30 mins (assuming the auto compaction doesn't
go pass 30 mins which I doubt it will).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]