vinothchandar commented on issue #1328: Hudi upsert hangs
URL: https://github.com/apache/incubator-hudi/issues/1328#issuecomment-585503340
 
 
   Reposting my response here.. 
   
   There seems to be a lot of common concerns here.. 
https://cwiki.apache.org/confluence/display/HUDI/Tuning+Guide is an useful 
resource, that hopefully can benefit here..
   
   Few high level thoughts:
   It would be good to layout if the most time spent is on the indexing stages 
(ones tagged with HoodieBloomIndex) or the actual writing.. 
   Hudi does keep the input in memory to compute the stats it needs to size 
files. So if you don't provide sufficient executore/rdd storage memory, it will 
spill and can cause slowdowns.. (covered in tuning guide & have seen this 
happen with users often)
   On workload pattern itself, BloomIndex range pruning can be turned off 
https://hudi.apache.org/docs/configurations.html#bloomIndexPruneByRanges if the 
keys ranges are random anyway.. Generally speaking, unless we have RFC-8 
(record level indexing), cases of random write/upserting majority of the rows 
in a table, may give bloom index overhead, since the bloom filters/ranges are 
not at all useful in pruning out files . We have an interim solution coming out 
in the next release.. falling back to plain old join to implement the indexing. 
   In terms or MOR and COW, MOR will help only if you have lots of updates and 
bottleneck is on the writing.. 
   If listing is an issue, please turn the following so the table is listed 
once and we re-use the filesytem metadata hoodie.embed.timeline.server=true
   I would appreciate a JIRA, so that I can break each into sub-task and 
tackle/resolve independently..
   
   I am personally focussing on performance now and want to make it lot faster 
in 0.6.0 release. So all this help would be deeply appreciated

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to