[GitHub] [hudi] nsivabalan commented on issue #3077: [SUPPORT] Large latencies in hudi writes using upsert mode.

GitBox Tue, 03 Aug 2021 07:56:30 -0700


nsivabalan commented on issue #3077:
URL: https://github.com/apache/hudi/issues/3077#issuecomment-891916056



   @karan867: sorry for the delay. let me try to help you out. 
   @abdulmuqeeth : Can you please file a new github issue and CC me in that. 
lets not pollute the same github issue. But feel free to follow for any 
pointers.
   
   @karan867 : coming back to your use-case. 
   1. Whats your max parquet size you are setting ? If its default, then its 
120Mb files. 
   2. Whats your avg record size? This will help determine your bloom filter 
numbers. 
   3. During your first ingest, what was the config value set for 
"hoodie.copyonwrite.insert.split.size". This will determine the size of each 
data file(only during first ingest since hudi does not have any stats to look 
at to determine the record size). 
   4. Does your record key have any timestamp ordering characteristics. If yes, 
then bloom type index would help us a lot. Basically every data file will have 
a min key and max key stored in footer and hudi will use that to trim out some 
data files.
   If keys are random, then we can disable range pruning and see if that helps. 
Wrt Simple Index, you are right. it is relative to your data set size. As it 
grows, time for SIMPLE index time might relatively increase. 
   5. Whats % of updates vs new inserts in your writes. 
   6. small file disabling should work. Only updates will get routed to old 
data files. All new inserts should go into new data files. If not, it could be 
a regression. can you confirm that once. (i.e. you see inserts going to old 
data files even after disabling small files).  
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] nsivabalan commented on issue #3077: [SUPPORT] Large latencies in hudi writes using upsert mode.

Reply via email to