garyli1019 commented on pull request #1602: URL: https://github.com/apache/incubator-hudi/pull/1602#issuecomment-628817124
@vinothchandar I definitely agree a statistical table would be a better approach, but it will take a while I believe. I am happy to contribute to this topic as well. Any other recommendation for a short term fix for this issue? I believe this bug could happen again. When something upstream goes wrong, like Kafka or HDFS goes down in the production for a short period of time, Hudi will have a chance to make an abnormal small commit. Regarding the bloom filter size, I think they all use the bloom filter entries and FP rate to calculate the size, for simple, dynamic, local, and global. Once we switch to the parquet native approach, we can change the way of the estimation. I think the calculation could be accurate. HBASE index is not covered in this PR though. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org