pengxianzi commented on issue #12585:
URL: https://github.com/apache/hudi/issues/12585#issuecomment-2574674708
> For bucketed table are you referring to the bucket index of MOR table? One
fact to know is that the writer would write pure avro logs at first so the
streaming reader would also read these logs.
>
> For streaming read we have an option value named "earliest" for the
`read.start-commit` option, which is more straight-forward.
>
> It looks like the waning log is normal because of the explicit specified
read start commit, this log shows there when the commit to read has already
been archived.
Thank you for your help! We followed your suggestion and used the following
configuration:
options.put(FlinkOptions.READ_START_COMMIT.key(), "earliest");
This configuration indeed resolved the read task lag issue, and we were able
to read the Hudi table and write to the Kudu table normally. However, the task
stopped after running for a while and threw the following error:
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - split_reader ->
Sink:Unnamed(1/1) switched from INITIALIZING to FAILED on container_e30_xxx
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job switched
from state RUNNING to FAILED
org.apache.flink.runtime.JobException: Recovery is suppressed by
FixedDelayRestartBackoffTimeStrategy(maxNumberRestartAttempts=3,
backoffTimeMs=60000)
Caused by: org.apache.hudi.exception.HoodieException: Get reader error for
path: hdfs://nameservice1:xxx.parquet
We tried to skip files by using the following configurations:
options.put("read.streaming.skip_clustering", "true");
options.put("read.streaming.skip_compaction", "true");
Clean Policy:
We used the following clean policy:
options.put("hoodie.clean.automatic", "true");
options.put("hoodie.cleaner.policy", "KEEP_LATEST_COMMITS");
options.put("hoodie.cleaner.commits.retained", "5");
options.put("hoodie.clean.async", "true");
But the issue persists.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]