yihua commented on issue #5481:
URL: https://github.com/apache/hudi/issues/5481#issuecomment-1169413106

   @MikeBuh sorry for getting back late.  If you still haven't figured out the 
right configs, 
[here](https://cwiki.apache.org/confluence/display/HUDI/Tuning+Guide) is a more 
detailed tuning guide for upserts.  At this point, tuning other configs, like 
making `spark.memory.fraction` and `spark.memory.storageFraction` small (e.g., 
0.2 as mentioned in the guide) to make the job reliably slow instead of 
crashing intermittently, are also helpful to try out.
   
   > size of target table: I noticed that reloading the same batches to a near 
empty table is more successful
   
   Given that the upsert is going to first index the input based on existing 
records in the data files, the target table size also matters for choosing the 
parallelism.  You may try 1000 for the parallelism.
   
   > file sizes: maybe having less but larger files in the target table can 
help when comparing and updating
   
   Larger files with a fewer number of files in the target table definitely 
help the indexing phase during upsert as fewer bloom index from footers are 
read.  Yet tunning the memory and fraction should still make it work with small 
files.
   
   > compaction and cleanup: if these are heavy operations that need lots of 
memory then perhaps they can be tweaked
   
   To start with, you may disable async compaction and clean services so that 
they don't interfere with the ingestion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to