nsivabalan commented on issue #4873: URL: https://github.com/apache/hudi/issues/4873#issuecomment-1050409991
yes, do set `hoodie.compact.inline` to true and get the compaction moving. and later you can think about moving to async flow. btw, wrt partitioning strategy, it depends on your query patterns. If your queries mostly have predicates around dates, its wise choice to partition with dates. Even your ingestion might speed up depending on your partitioning strategy. For eg, if your dataset is date partitioned, and if your incoming records have data only for 5 partitions, index lookup happens only among the 5 partitions. but if your partitioning is complex and is based of of both date and some other field X, num of partitions touched might be more. Also, do check out the cardinality of your partitioning scheme. Ensure you do not have very high no of partitions (small sized), nor too less num of partitions (large sized). try to find some middle ground. then you also asked about timestamp characteristics of record keys right. Let me try to illustrate w/ an example. lets say, your record keys are literally timestamp field. During commit1, you ingest records with min and max value as t1 to t100. and this goes into data file1 with commit2, you ingest records with min and max as t101 to t201. and this goes into data file2. and ..... commit10, say you ingest with min and max as t900 to t1000... data file 10. now, lets say you have some updates. for record with key t55 to t65 and t305 to t310. since each file is nicely laid out, using just the min max values, all files except file1 and file3 will be filtered out in first step in index. And then bloom filter look up happens and then actual files will be looked up to find the right location for the keys. Alternatively, lets say your record keys are random. commit1 -> data file1: min and max values are t5 and t500 commit2 -> data file2: min and max values are t50 and t3000 . . commit10 -> data file10: min and max values are t70 and t2500. for when we get some updates, pretty much all files will be considered for 2nd step in index. in other words, min max based pruning will not be effective and it just adds to your latency. Hope this clarifies what I mean by timestamp characteristics in your record keys. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
