bwu2 edited a comment on issue #1328: Hudi upsert hangs
URL: https://github.com/apache/incubator-hudi/issues/1328#issuecomment-585512613
 
 
   @vinothchandar Thanks for taking the time to reply!
   
   Let me describe the simplest example of this problem on a tiny COW data set: 
Create a data frame with 4m rows and one column with values 1, 2, 3....4m in 
that column. Bulk insert that into Hudi (using the one column as the 
`recordkey`). This takes ~1 minute to run and the data size is about 30MB. Now 
upsert the same data frame into the existing table. This take >2 hours to run.
   
   Alternatively, if we upsert a new data frame into the existing table with 
values 4000001...8m (still 4m rows upserted), this takes ~1 minute to run.
   
   To answer your other queries: 
   * almost all of the time is spent in the  `HoodieSparkSqlWriter` job (and 
within that job, the `count at HoodieSparkSqlWriter.scala` stage (the 
`HoodieBloomIndex` jobs run quickly).
   * it seems highly unlikely to be resource constraint issue with such a small 
example.
   
   Shall i raise a JIRA for this? Or is this the expected behavior for such a 
workload?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to