xiearthur commented on issue #12523: URL: https://github.com/apache/hudi/issues/12523#issuecomment-2571593978
Thank you for your input. Regarding the sub-pipeline issue, upon review, we confirmed that we do not utilize multiple sub-pipelines in our data processing flow. Addressing the checkpoint timeout and failure in our Flink job, we have made adjustments by setting the checkpoint timeout to 10 minutes and increasing the write size parallelism to 10. These changes have led to successful data writes. We identified that the previous timeouts were due to the incomplete merging of Parquet files within a single checkpoint interval. We are currently using a Simple Bucketing strategy with a Copy-on-Write (COW) table. Our current bucket size ranges from 300MB to 800MB. It appears that the number of buckets might be insufficient, leading to oversized individual buckets, which could be impacting the efficiency of the merge operation. As for the optimal bucket size, it typically depends on the data characteristics and system resources. A general guideline is to keep individual bucket sizes between 100MB to 500MB to balance processing efficiency and resource usage. We will further adjust the number of buckets to optimize our data processing. We will continue to monitor system performance and make necessary adjustments as needed. If you have any further suggestions or questions, please feel free to reach out. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
