xiearthur commented on issue #12523:
URL: https://github.com/apache/hudi/issues/12523#issuecomment-2571593978

   Thank you for your input. Regarding the sub-pipeline issue, upon review, we 
confirmed that we do not utilize multiple sub-pipelines in our data processing 
flow. Addressing the checkpoint timeout and failure in our Flink job, we have 
made adjustments by setting the checkpoint timeout to 10 minutes and increasing 
the write size parallelism to 10. These changes have led to successful data 
writes.
   
   We identified that the previous timeouts were due to the incomplete merging 
of Parquet files within a single checkpoint interval. We are currently using a 
Simple Bucketing strategy with a Copy-on-Write (COW) table.
   
   Our current bucket size ranges from 300MB to 800MB. It appears that the 
number of buckets might be insufficient, leading to oversized individual 
buckets, which could be impacting the efficiency of the merge operation.
   
   As for the optimal bucket size, it typically depends on the data 
characteristics and system resources. A general guideline is to keep individual 
bucket sizes between 100MB to 500MB to balance processing efficiency and 
resource usage. We will further adjust the number of buckets to optimize our 
data processing.
   
   We will continue to monitor system performance and make necessary 
adjustments as needed. If you have any further suggestions or questions, please 
feel free to reach out.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to