nsivabalan commented on issue #3077: URL: https://github.com/apache/hudi/issues/3077#issuecomment-891916056
@karan867: sorry for the delay. let me try to help you out. @abdulmuqeeth : Can you please file a new github issue and CC me in that. lets not pollute the same github issue. But feel free to follow for any pointers. @karan867 : coming back to your use-case. 1. Whats your max parquet size you are setting ? If its default, then its 120Mb files. 2. Whats your avg record size? This will help determine your bloom filter numbers. 3. During your first ingest, what was the config value set for "hoodie.copyonwrite.insert.split.size". This will determine the size of each data file(only during first ingest since hudi does not have any stats to look at to determine the record size). 4. Does your record key have any timestamp ordering characteristics. If yes, then bloom type index would help us a lot. Basically every data file will have a min key and max key stored in footer and hudi will use that to trim out some data files. If keys are random, then we can disable range pruning and see if that helps. Wrt Simple Index, you are right. it is relative to your data set size. As it grows, time for SIMPLE index time might relatively increase. 5. Whats % of updates vs new inserts in your writes. 6. small file disabling should work. Only updates will get routed to old data files. All new inserts should go into new data files. If not, it could be a regression. can you confirm that once. (i.e. you see inserts going to old data files even after disabling small files). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
