[GitHub] [hudi] rubenssoto commented on issue #1878: [SUPPORT] Spark Structured Streaming To Hudi Sink Datasource taking much longer

GitBox Wed, 29 Jul 2020 00:30:21 -0700


rubenssoto commented on issue #1878:
URL: https://github.com/apache/hudi/issues/1878#issuecomment-665432999



   Hi bvaradar, how are you? I hope doing fine!
   
   I have a new case, which is a little more important to me, the problem is 
almost the same. I adopted the strategy to first batch all data in an insert 
operation and after that, get the latest data with structured streaming. 
   
   Answer your question, all my tables have PK with integers id and normally 
they are auto-increment. Does Hudi already order data in an insert operation by 
pk? Because in my first batch I am sorting the data by date, is it necessary?
   
   I think I have the CoW problem that you said. I have an order table with my 
clients orders, every minute new orders arrive, and my clients could give a 
grade to the order at any point in time, for example in a streaming batch could 
have a client order grade for an order that was made in the last month.
   
   This table, today, is very small, in hudi dataset, are 15 files of 500mb 
each, I didn't partition the table because a daily partition is small and 
partition by month I think don't make sense. 
   My streaming is running right now, but Hudi rewrites all 15 files every 
streaming batch, my data is small, so its fine, but I think it is not efficient 
and when data the grows it could become a problem.
   
   I will use aws Athena to query all my tables and this specific order table 
may be delayed up to 15 minutes. I saw that Athena only query Read Optmized 
MoR, how MoR could help me in this case?
   
   The last question, in an insert operation, how can I control the file size?
   
   Thank you for your time!
   
   Some images of my streaming:
   ![Uploading Captura de Tela 2020-07-29 às 02.04.06.png…](
   <img width="1680" alt="Captura de Tela 2020-07-29 às 02 03 54" 
src="https://user-images.githubusercontent.com/36298331/88758874-ea2a3180-d13f-11ea-914c-268135f002f9.png";>
   <img width="1680" alt="Captura de Tela 2020-07-29 às 02 03 33" 
src="https://user-images.githubusercontent.com/36298331/88758879-ebf3f500-d13f-11ea-9f13-0e731940b605.png";>
   <img width="1680" alt="Captura de Tela 2020-07-29 às 02 01 51" 
src="https://user-images.githubusercontent.com/36298331/88758885-ee564f00-d13f-11ea-802b-c896de02ded7.png";>
   
   
   )
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] rubenssoto commented on issue #1878: [SUPPORT] Spark Structured Streaming To Hudi Sink Datasource taking much longer

Reply via email to