nsivabalan commented on issue #2620: URL: https://github.com/apache/hudi/issues/2620#issuecomment-816306202
@codejoyan : sorry, somehow slipped from my radar. May I know whats the scale of data you are dealing with? I see your parallelism is very less (2). Can you try w/ 100 or more and see how it goes. Among 3 methods you have quoted, 2 of them are index related and 3rd is actual write operation. Best way to decide partitionin strategy is to see what your queries usually filter based on. If its date based, then you definitely need to have date in your partitioning strategy which you already do. And if adding region would cut down most of the data to be looked up, sure. I assume this would also blow up your # partitions in general since its no of dates * no of regions. wrt record keys and bloom: You can try to use regular bloom "BLOOM" as index. With this, there are few config knobs. with simple bloom, we don't lot of config knobs to play around. within a single batch of writes, does records have some ordering to it or is it just random. From your response I guess its random. So, you can turn of range pruning since that may not help much. https://hudi.apache.org/docs/configurations.html#bloomIndexPruneByRanges to false. (default value is true). @n3nash : do you have any pointers here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org