rubenssoto commented on issue #1878: URL: https://github.com/apache/hudi/issues/1878#issuecomment-665432999
Hi bvaradar, how are you? I hope doing fine! I have a new case, which is a little more important to me, the problem is almost the same. I adopted the strategy to first batch all data in an insert operation and after that, get the latest data with structured streaming. Answer your question, all my tables have PK with integers id and normally they are auto-increment. Does Hudi already order data in an insert operation by pk? Because in my first batch I am sorting the data by date, is it necessary? I think I have the CoW problem that you said. I have an order table with my clients orders, every minute new orders arrive, and my clients could give a grade to the order at any point in time, for example in a streaming batch could have a client order grade for an order that was made in the last month. This table, today, is very small, in hudi dataset, are 15 files of 500mb each, I didn't partition the table because a daily partition is small and partition by month I think don't make sense. My streaming is running right now, but Hudi rewrites all 15 files every streaming batch, my data is small, so its fine, but I think it is not efficient and when data the grows it could become a problem. I will use aws Athena to query all my tables and this specific order table may be delayed up to 15 minutes. I saw that Athena only query Read Optmized MoR, how MoR could help me in this case? The last question, in an insert operation, how can I control the file size? Thank you for your time! Some images of my streaming:  ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
