jihoonson commented on a change in pull request #10243:
URL: https://github.com/apache/druid/pull/10243#discussion_r472545452
##########
File path:
core/src/main/java/org/apache/druid/data/input/MaxSizeSplitHintSpec.java
##########
@@ -44,7 +45,7 @@
public static final String TYPE = "maxSize";
@VisibleForTesting
- static final HumanReadableBytes DEFAULT_MAX_SPLIT_SIZE = new
HumanReadableBytes("512MiB");
+ static final HumanReadableBytes DEFAULT_MAX_SPLIT_SIZE = new
HumanReadableBytes("1GiB");
Review comment:
Hmm, good question. I would say it could be more useful to have a way to
apply different default configurations per datasource since the `maxSplitSize`
should be adjusted based on the shape of input data and partitioning scheme of
output data. But for this, I think it could be better to add a supervisor which
periodically performs batch ingestion based on the user-provided configurations.
Particularly regarding keeping the previous default, I'm not sure when it
would be good to do. `maxSplitSize` is mostly for controlling the parallelism
of the phase which reads data from inputSource in parallel indexing, but it
also affects the number of segments created after the input-read phase. So,
there is a trade-off between them. However, I would say increasing maxSplitSize
512 MB to 1 GB wouldn't change things dramatically. If you have a cluster where
all subtasks split by the previous default can run at the same time but not
with the new default, you might want to use the previous default because, in
theory, it will give you 2 times better read performance. However, in practice,
you would likely have more than one task to run at the same time, which the
cluster resource should be shared across.
What do you think?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]