clintropolis commented on a change in pull request #10243:
URL: https://github.com/apache/druid/pull/10243#discussion_r473313243



##########
File path: docs/ingestion/native-batch.md
##########
@@ -232,7 +232,8 @@ The size-based split hint spec is respected by all 
splittable input sources exce
 |property|description|default|required?|
 |--------|-----------|-------|---------|
 |type|This should always be `maxSize`.|none|yes|
-|maxSplitSize|Maximum number of bytes of input files to process in a single 
task. If a single file is larger than this number, it will be processed by 
itself in a single task (Files are never split across tasks yet).|500MB|no|
+|maxSplitSize|Maximum number of bytes of input files to process in a single 
task. If a single file is larger than this number, it will be processed by 
itself in a single task (Files are never split across tasks yet). Noe that one 
subtask will not process more files than `maxNumFiles` even if their total size 
is smaller than `maxSplitSize`. [Human-readable 
format](../configuration/human-readable-byte.md) is supported.|1GiB|no|

Review comment:
       typo: 'Noe' -> 'Note'

##########
File path: 
core/src/main/java/org/apache/druid/data/input/MaxSizeSplitHintSpec.java
##########
@@ -43,22 +45,55 @@
   public static final String TYPE = "maxSize";
 
   @VisibleForTesting
-  static final long DEFAULT_MAX_SPLIT_SIZE = 512 * 1024 * 1024;
+  static final HumanReadableBytes DEFAULT_MAX_SPLIT_SIZE = new 
HumanReadableBytes("1GiB");
 
-  private final long maxSplitSize;
+  /**
+   * There are two known issues when a split contains a large list of files.
+   *
+   * - 'jute.maxbuffer' in ZooKeeper. This system property controls the max 
size of ZNode. As its default is 500KB,
+   *   task allocation can fail if the serialized ingestion spec is larger 
than this limit.
+   * - 'max_allowed_packet' in MySQL. This is the max size of a communication 
packet sent to a MySQL server.
+   *   The default is either 64MB or 4MB depending on MySQL version. Updating 
metadata store can fail if the serialized
+   *   ingestion spec is larger than this limit.
+   *
+   * The default is consertively chosen as 1000.

Review comment:
       is this a typo: 'consertively' -> 'conservatively'?

##########
File path: docs/ingestion/native-batch.md
##########
@@ -232,7 +232,8 @@ The size-based split hint spec is respected by all 
splittable input sources exce
 |property|description|default|required?|
 |--------|-----------|-------|---------|
 |type|This should always be `maxSize`.|none|yes|
-|maxSplitSize|Maximum number of bytes of input files to process in a single 
task. If a single file is larger than this number, it will be processed by 
itself in a single task (Files are never split across tasks yet).|500MB|no|
+|maxSplitSize|Maximum number of bytes of input files to process in a single 
task. If a single file is larger than this number, it will be processed by 
itself in a single task (Files are never split across tasks yet). Noe that one 
subtask will not process more files than `maxNumFiles` even if their total size 
is smaller than `maxSplitSize`. [Human-readable 
format](../configuration/human-readable-byte.md) is supported.|1GiB|no|
+|maxNumFiles|Maximum number of input files to process in a single task. This 
limit is to avoid task failures when the ingestion spec is too long. There are 
two known limits on the max size of serialized ingestion spec, i.e., the max 
ZNode size in ZooKeeper (`jute.maxbuffer`) and the max packet size in MySQL 
(`max_allowed_packet`). These can make ingestion tasks fail if the serialized 
ingestion spec size hits one of them. Note that one subtask will not process 
more data than `maxSplitSize` even if the total number of files is smaller than 
`maxNumFiles`.|1000|no|

Review comment:
       Does this limit apply to the entire parallel task, just the subtasks, or 
both? It isn't super clear from the docs here, though from my interpretation of 
the code it looks like this applies to subtasks?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to