[GitHub] [hudi] nsivabalan commented on issue #4873: Processing time very Slow Updating records into Hudi Dataset(MOR) using AWS Glue

GitBox Thu, 24 Feb 2022 16:54:13 -0800


nsivabalan commented on issue #4873:
URL: https://github.com/apache/hudi/issues/4873#issuecomment-1050409991



   yes, do set `hoodie.compact.inline` to true and get the compaction moving. 
and later you can think about moving to async flow. 
   
   btw, wrt partitioning strategy, it depends on your query patterns. If your 
queries mostly have predicates around dates, its wise choice to partition with 
dates. 
   
   Even your ingestion might speed up depending on your partitioning strategy.
   
   For eg, if your dataset is date partitioned, and if your incoming records 
have data only for 5 partitions, index lookup happens only among the 5 
partitions. but if your partitioning is complex and is based of of both date 
and some other field X, num of partitions touched might be more. 
   
   Also, do check out the cardinality of your partitioning scheme. Ensure you 
do not have very high no of partitions (small sized), nor too less num of 
partitions (large sized). 
   try to find some middle ground. 
   
   then you also asked about timestamp characteristics of record keys right.
   Let me try to illustrate w/ an example.
   lets say, your record keys are literally timestamp field. 
   
   During commit1, you ingest records with min and max value as t1 to t100. and 
this goes into data file1
   with commit2, you ingest records with min and max as t101 to t201. and this 
goes into data file2.
   and ..... commit10, say you ingest with min and max as t900 to t1000... data 
file 10.
   
   now, lets say you have some updates. 
   for record with key t55 to t65 and t305 to t310.
   since each file is nicely laid out, using just the min max values, all files 
except file1 and file3 will be filtered out in first step in index. And then 
bloom filter look up happens and then actual files will be looked up to find 
the right location for the keys. 
   
   Alternatively, lets say your record keys are random. 
   commit1 -> data file1: min and max values are t5 and t500
   commit2 -> data file2: min and max values are t50 and t3000
   .
   .
   commit10 -> data file10: min and max values are t70 and t2500.
   
   for when we get some updates, pretty much all files will be considered for 
2nd step in index. in other words, min max based pruning will not be effective 
and it just adds to your latency. 
   
   Hope this clarifies what I mean by timestamp characteristics in your record 
keys. 
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] nsivabalan commented on issue #4873: Processing time very Slow Updating records into Hudi Dataset(MOR) using AWS Glue

Reply via email to