gunjdesai opened a new issue, #6610: URL: https://github.com/apache/hudi/issues/6610
**Environment Description** * Hudi version : `0.11.0` * Spark version : `3.2.0` * Hive Metastore version : `3.1.0` * Storage (HDFS/S3/GCS..) : `Minio` * Running on Docker? (yes/no) : `yes` Hi Folks, We are using Hudi via Spark to push data in Trino. We started the pipeline recently and data accuracy is as expected. For older data since we want to perform backfills, we are pushing older data in the same topic as that of the realtime data. This works well for us till a point, but after a while, the data in the topic becomes so huge, that the **_Tagging task takes more than 12 hours with only running on one thread at a time, before eventually failing_** Can you’ll suggest a better to approach backfill while not needing to shutdown real-time pipelines ? Any guidance is deeply appreciated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
