[GitHub] [hudi] gunjdesai commented on issue #6610: [QUESTION] Faster approach for backfilling older data without stopping realtime pipelines

GitBox Thu, 15 Sep 2022 01:40:15 -0700


gunjdesai commented on issue #6610:
URL: https://github.com/apache/hudi/issues/6610#issuecomment-1247777136


   @xushiyan yes this is a spark structured streaming job. So we are running 
the job on K8S Spot instances, there are cases where we face driver eviction, 
hence we can't use multi-writer approach as it can mess with the locks.
   Yes the job does scale based on backfill traffic going up.
   
   Actually the idea was not to stop the real-time pipeline when doing 
backfill, but i think our setup would not allow us to do that. 
   
   On further reading, I was thinking about stopping the real-time pipeline, 
doing a **_bulk_insert_** for the table and then starting the real-time 
pipeline again in **_upsert_** mode
   Would you say this could be a good approach ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] gunjdesai commented on issue #6610: [QUESTION] Faster approach for backfilling older data without stopping realtime pipelines

Reply via email to