jihoonson commented on a change in pull request #10243:
URL: https://github.com/apache/druid/pull/10243#discussion_r472545452



##########
File path: 
core/src/main/java/org/apache/druid/data/input/MaxSizeSplitHintSpec.java
##########
@@ -44,7 +45,7 @@
   public static final String TYPE = "maxSize";
 
   @VisibleForTesting
-  static final HumanReadableBytes DEFAULT_MAX_SPLIT_SIZE = new 
HumanReadableBytes("512MiB");
+  static final HumanReadableBytes DEFAULT_MAX_SPLIT_SIZE = new 
HumanReadableBytes("1GiB");

Review comment:
       Hmm, good question. I would say it could be more useful to have a way to 
apply different default configurations per datasource since the `maxSplitSize` 
should be adjusted based on the shape of input data and partitioning scheme of 
output data. But for this, I think it could be better to add a supervisor which 
periodically performs batch ingestion based on the user-provided configurations.
   
   Particularly regarding keeping the previous default, I'm not sure when it 
would be good to do. `maxSplitSize` is mostly for controlling the parallelism 
of the phase which reads data from inputSource in parallel indexing, but it 
also affects the number of segments created after the input-read phase. So, 
there is a trade-off between them. However, I would say increasing maxSplitSize 
512 MB to 1 GB wouldn't change things dramatically. If you have a cluster where 
all subtasks split by the previous default can run at the same time but not 
with the new default, you might want to use the previous default because, in 
theory, it will give you 2 times better read performance. However, in practice, 
you would likely have more than one task to run at the same time, which the 
cluster resource should be shared across.
   
   What do you think?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to