pengxianzi opened a new issue, #12585:
URL: https://github.com/apache/hudi/issues/12585
We are using Apache Hudi to build a data lake and writing data to a Kudu
table downstream. The following two scenarios exhibit different behaviors:
Scenario 1: The upstream writes data using a regular MOR (Merge On Read)
Hudi table, and the downstream reads the Hudi table and writes to the Kudu
table without any issues.
Scenario 2: The upstream writes data using a bucketed table, and when the
downstream reads the Hudi table and attempts to write to the Kudu table, the
task fails with the following warning:
`caution: the reader has fallen behind too much from the writer, tweak
'read.tasks' option to add parallelism of read tasks`
We have tried setting the read.tasks parameter to 10, but the issue
persists. Below are our configuration and environment details:
Hudi version : 0.14.0
Spark version : 2.4.7
Hive version : 3.1.3
Hadoop version : 3.1.1
Storage Format: HDFS
Downstream Storage: Apache Kudu
Bucketed Table Configuration: Number of buckets is 10
Configuration Information
Below is our Hudi table configuration:
Map<String, String> options = new HashMap<>();
options.put(FlinkOptions.PATH.key(), basePath+tableName);
options.put(FlinkOptions.TABLE_TYPE.key(), name);
options.put(FlinkOptions.READ_AS_STREAMING.key(), "true");
options.put(FlinkOptions.PRECOMBINE_FIELD.key(),precombing);
options.put(FlinkOptions.READ_START_COMMIT.key(), "20210316134557");
options.put("read.streaming.skip_clustering", "true");
options.put("read.streaming.skip_compaction", "true");
Steps to Reproduce
1. The upstream writes data to a Hudi MOR table using a bucketed table.
2. The downstream reads the Hudi table and attempts to write the data to
the Kudu table.
3. The task fails with the warning: reader has fallen behind too much from
the writer.
Attempted Solutions
1. Set the read.tasks parameter to 10, but the issue persists.
2. Checked the data distribution of the bucketed table to ensure there is
no data skew.
3. Checked the file layout of the Hudi table to ensure there are no
excessive small files.
Expected Behavior
The downstream should be able to read the Hudi MOR table written by the
bucketed table and write the data to the Kudu table normally.
Actual Behavior
The downstream read task fails with the warning: reader has fallen behind
too much from the writer.
Log Snippet
Below is a snippet of the log when the task fails:
Caused
by:org.apache.flink.runtime.resourcemanager.exceptions.UnknownTaskExecutorException:No
TaskExecutor registered under container e38 1734494154374 0718 01 000002
caused by :org.apache.flink.util.FlinkRuntimeException:Exceeded checkpoint
tolerable failure threshould
Caused by: java.util.concurrent.TimeoutException
caution :the reader has fall behind too much from the write , tweak
'read.tasks' option to add parallelism of read tasks
ERROR org.apache.flink.runtime.taseManagerRunner [] - Fatal error occurred
while executing the TaskManager. Shutting it down ...
org.apache.flink.util.FlinkException: the TaskExecutor's registration at the
ResourceManager akka.tcp://node1:7791/user/rpc/resourcemanager_0 has been
rejected :Rejected TaskExecutor registration at the ResourceManager because:The
ResourceManager does not recognize this TaskExecutor
Summary of Questions
We would like to know:
1. Why does the read task fall behind when using a bucketed table?
2. Are there any other configurations besides read.tasks that can optimize
read performance?
3. Are there any known issues or limitations related to the combination of
bucketed tables and MOR tables?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]