[GitHub] [druid] jihoonson commented on a change in pull request #10676: Allow client to configure batch ingestion task to wait to complete until segments are confirmed to be available by other

GitBox Thu, 25 Mar 2021 17:32:15 -0700


jihoonson commented on a change in pull request #10676:
URL: https://github.com/apache/druid/pull/10676#discussion_r601926994




##########
File path: 
indexing-service/src/test/java/org/apache/druid/indexing/common/task/CompactionTaskTest.java
##########
@@ -1447,6 +1452,7 @@ private void assertIngestionSchema(
             null,
             null,
             null,
+            null,
             null

Review comment:
       Interesting point. I think there are some things we should think about 
first.
   
   - It's true that currently compaction doesn't change the underlying data 
much, but it can make some changes such as filtering out some unnecessary 
dimensions or adding new metrics. You can also change the query granularity 
now. In the future, I can imagine that you can even transform your data using 
compaction with a new support for transformSpec.
   - The compaction task is a bit special and different from other batch tasks 
in how it publishes segments. All other batch tasks can push segments in the 
middle of indexing, but should publish all those segments at the end of 
indexing. However, the compaction task can process each time chunk at a time 
when there is no change in segment granularity. In this case, it can publish 
segments whenever it finishes processing individual time chunk. It can also go 
through all time chunks even when there are some time chunks that it fails to 
compact. The final task status will be `FAILED` when it succeeds to compact 
only some time chunks but fails for others.
   - Compacting datasources is usually not the single-shot type job. Rather, 
you would run multiple small compaction tasks over time as in auto compaction. 
In that case, you would want to know what time chunks are compacted and what 
are not, so that you can determine what result you can get when you query 
certain time chunks. For the compaction that is manually set up outside druid, 
tracking of individual compaction tasks could be useful for this purpose. 
However, for auto compaction, it won't provide much value since compaction 
tasks are submitted by the coordinator not users. So, we need another way such 
as adding a new coordinator API that returns such compaction status.
   
   From these, we would probably want something similar but different for 
compaction from the one proposed here. I would suggest to do it in a different 
PR. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] jihoonson commented on a change in pull request #10676: Allow client to configure batch ingestion task to wait to complete until segments are confirmed to be available by other

Reply via email to