[GitHub] [hudi] machadoluiz commented on issue #8824: [SUPPORT] Performance and Data Integrity Issues with Hudi for Long-Term Data Retention

via GitHub Wed, 31 May 2023 10:55:13 -0700


machadoluiz commented on issue #8824:
URL: https://github.com/apache/hudi/issues/8824#issuecomment-1570666233


   Thank you for your attention regarding our issue, @nfarah86 and 
@ad1happy2go. Here is the response for each of your questions:
   
   > following up from slack: 6 years of data in the active timeline is a lot 
of data.
   > 
   > 1. what kind of queries are you running? Do you need incremental queries 
across 6 years of data?
   > 2. Do you have a multi-writer situation where multiple writers are writing 
to the same table?
   > 3. Can you share the Hudi timeline in the .hoodie folder?
   > 4. is the data mostly insert or upsert or a mixed of both?
   > 5. How are you partitioning the data?
   
   We acknowledge that 6 years is a large amount of data, but we need to keep 
the history of each run over time, as the data are used to make decisions that 
affect other companies. For this reason, it is necessary to log the state of 
the data at the moment they were used for decision making. For legal reasons, 
we need to store this history for auditing and future consultation. We have 
records in the database that can have retroactive dates, which may affect the 
results of our indicators.
   
   1. Usually, we query the latest version of the data, apply filters, among 
other operations. However, we need to store the data history because eventually 
it will be necessary to consult a specific period of time. That's why we chose 
Hudi.
   2. No, all tables have a single script and are not run in parallel.
   3. For legal reasons, we cannot share the files of the actual tables, but we 
can send from the example described above in which we simulate the problem. 
Here is the link to download the files: [Google 
Drive](https://drive.google.com/drive/folders/1Iyu9AlwVHSqQLN8cR5diOF2pVZtl96ib)
   4. It varies, but the most common methods are "insert_overwrite_table", 
"insert" and "upsert". In the example above, we tested using 
"insert_overwrite_table".
   5. The partitions also vary depending on the table size. Usually, we use the 
script execution day (LOAD_DATE) or a specific column of the data itself as the 
base for partitioning. Moreover, there are cases of very small tables that are 
not partitioned.
   
   > Also Please confirm you are using COW table as I don't see table.type in 
configs. Default value is COW.
   
   We are using CoW because we didn't configure it directly, but we tested MoR 
and the performance issue remains.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] machadoluiz commented on issue #8824: [SUPPORT] Performance and Data Integrity Issues with Hudi for Long-Term Data Retention

Reply via email to