nsivabalan commented on issue #2620:
URL: https://github.com/apache/hudi/issues/2620#issuecomment-816306202


   @codejoyan : sorry, somehow slipped from my radar. 
   May I know whats the scale of data you are dealing with? I see your 
parallelism is very less (2). Can you try w/ 100 or more and see how it goes. 
   
   Among 3 methods you have quoted, 2 of them are index related and 3rd is 
actual write operation. 
   
   Best way to decide partitionin strategy is to see what your queries usually 
filter based on. If its date based, then you definitely need to have date in 
your partitioning strategy which you already do. And if adding region would cut 
down most of the data to be looked up, sure. I assume this would also blow up 
your # partitions in general since its no of dates * no of regions. 
   
   wrt record keys and bloom: 
   You can try to use regular bloom "BLOOM" as index. With this, there are few 
config knobs. with simple bloom, we don't lot of config knobs to play around. 
   within a single batch of writes, does records have some ordering to it or is 
it just random. From your response I guess its random. So, you can turn of 
range pruning since that may not help much. 
   https://hudi.apache.org/docs/configurations.html#bloomIndexPruneByRanges to 
false. (default value is true). 
   
   @n3nash : do you have any pointers here. 
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to