[GitHub] [hudi] gunjdesai opened a new issue, #6610: [QUESTION] Faster approach for backfilling older data without stopping realtime pipelines

GitBox Tue, 06 Sep 2022 03:30:59 -0700


gunjdesai opened a new issue, #6610:
URL: https://github.com/apache/hudi/issues/6610


   **Environment Description**
   
   * Hudi version : `0.11.0`
   
   * Spark version : `3.2.0`
   
   * Hive Metastore version : `3.1.0`
   
   * Storage (HDFS/S3/GCS..) : `Minio`
   
   * Running on Docker? (yes/no) : `yes`
   
   Hi Folks,
   We are using Hudi via Spark to push data in Trino. We started the pipeline 
recently and data accuracy is as expected. 
   For older data since we want to perform backfills, we are pushing older data 
in the same topic as that of the realtime data.
   This works well for us till a point, but after a while, the data in the 
topic becomes so huge, that the **_Tagging task takes more than 12 hours with 
only running on one thread at a time, before eventually failing_**
   
   Can you’ll suggest a better to approach backfill while not needing to 
shutdown real-time pipelines ?
   
   Any guidance is deeply appreciated.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] gunjdesai opened a new issue, #6610: [QUESTION] Faster approach for backfilling older data without stopping realtime pipelines

Reply via email to