dmenin commented on issue #3975: URL: https://github.com/apache/hudi/issues/3975#issuecomment-1004110981
> Hi @nsivabalan "How is your updates/deletes spread in general? " I have the data partitioned by year/month/day, and deal with time related data, the vast majority of the updates happen on the same partition, however, a percentage happens on different partitions (yesterday's and the days before). This percentage gets smaller towards the end of the day. Since I get a batch of data from the source system containing "everything that changed in the last X amount of hours (X can vary between 24 - 72 hours) at 00:10, there are a lot of updates because most of the things will refer to yesterday's partition. At 23:00 however, I would be mainly accessing the data at today's partition. regarding Indexes, first of all, none of the GLOBAL indexes perform due to the amount of data and partitions (we have data for more than 3 years - it takes a VERY long time to build the index), so I HAVE to use SIMPLE - and, by the way, this is the reason why I do a delete and upsert . -if I dont delete some records I'd end up with duplicates. I have tried with BLOOM simple but it didnt affect the performance, unfortunately. "Disabling small files" also didnt help - and I dont understand the logic behind changing that setting when we are just deleting data - could you elaborate on why you think that would help? regarding the MOR exploration, I found a problem where timestamp columns are not supported on the "_rt" tables - works fine on the "_ro", but on hte "_rt" I get a "GENERIC_INTERNAL_ERROR: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.hive.serde2.io.TimestampWritable" error - I am working with AWS support to fix this, they were able to reproduce the problem and have a ETA to fix in in the end of January. All that being said, I'll just take this opportunity to suggest a HUDI feature - the ability to have a "GLOBAL" index on a pre determined list of partitions. For example, considering that today is the 3rd of January 2022, If I could run my upset and, in one of the parameters sauy, create the index on partitions: year=2021/month=12/day=31 year=2022/month=1/day=1 year=2022/month=1/day=2 year=2022/month=1/day=3 that would solve my problem. Is this kind of thing on the roadmap? Thanks, Diego -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
