[GitHub] [hudi] vinothchandar commented on issue #1757: Slow Bulk Insert Performance [SUPPORT]

GitBox Tue, 23 Jun 2020 13:22:02 -0700


vinothchandar commented on issue #1757:
URL: https://github.com/apache/hudi/issues/1757#issuecomment-648396783



   @somebol assuming this is an initial load and after this, you would do 
insert/upsert operations incrementally? 
   
   High level, `bulk_insert` does a sort and writes out the data. From what I 
can tell, you have sufficient parallelism. But a bunch of tasks are failing and 
retrying probably adds a bunch of time to the runs? (stage 2, 4). Could you 
look into how skewed task runtimes within those stages are? 
   
   P.S:  We do incur the cost of Row -> GenericRecord -> Parquet (@nsivabalan 
has a branch with a fix, that will make it to 0.6.0) 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] vinothchandar commented on issue #1757: Slow Bulk Insert Performance [SUPPORT]

Reply via email to