nsivabalan commented on issue #4873: URL: https://github.com/apache/hudi/issues/4873#issuecomment-1301568436
btw, not sure if I have called this out before. I see you are partitioning by hour. this would result in very high cardinality wrt num of partitions > 25k for few years of data. Generally its advisable to keep the total number of partitions 10k or less. If not, we have to spend lot of time doing the perf tuning. Alternatively you can employ clustering to cluster your data based on hour and reap the similar benefits based on col stats pruning w/ metadata table. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
