machadoluiz commented on issue #8824: URL: https://github.com/apache/hudi/issues/8824#issuecomment-1570666233
Thank you for your attention regarding our issue, @nfarah86 and @ad1happy2go. Here is the response for each of your questions: > following up from slack: 6 years of data in the active timeline is a lot of data. > > 1. what kind of queries are you running? Do you need incremental queries across 6 years of data? > 2. Do you have a multi-writer situation where multiple writers are writing to the same table? > 3. Can you share the Hudi timeline in the .hoodie folder? > 4. is the data mostly insert or upsert or a mixed of both? > 5. How are you partitioning the data? We acknowledge that 6 years is a large amount of data, but we need to keep the history of each run over time, as the data are used to make decisions that affect other companies. For this reason, it is necessary to log the state of the data at the moment they were used for decision making. For legal reasons, we need to store this history for auditing and future consultation. We have records in the database that can have retroactive dates, which may affect the results of our indicators. 1. Usually, we query the latest version of the data, apply filters, among other operations. However, we need to store the data history because eventually it will be necessary to consult a specific period of time. That's why we chose Hudi. 2. No, all tables have a single script and are not run in parallel. 3. For legal reasons, we cannot share the files of the actual tables, but we can send from the example described above in which we simulate the problem. Here is the link to download the files: [Google Drive](https://drive.google.com/drive/folders/1Iyu9AlwVHSqQLN8cR5diOF2pVZtl96ib) 4. It varies, but the most common methods are "insert_overwrite_table", "insert" and "upsert". In the example above, we tested using "insert_overwrite_table". 5. The partitions also vary depending on the table size. Usually, we use the script execution day (LOAD_DATE) or a specific column of the data itself as the base for partitioning. Moreover, there are cases of very small tables that are not partitioned. > Also Please confirm you are using COW table as I don't see table.type in configs. Default value is COW. We are using CoW because we didn't configure it directly, but we tested MoR and the performance issue remains. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
