Hi, We have a requirement where we keep audit_history of every change and sometimes query on that as well. In RDBMS we have separate tables for audit_history. However in HUDI, history is being created at every ingestion and I want to leverage so I do have a question on incremental query. Does incremental query runs on latest parquet file or on all the parquet files in the partition ? I can see it runs only on latest parquet file.
Let me illustrate more what we need. For eg we have data with 2 columns - (id | name) where id is the primary key. Batch 1 - Inserted 2 record --> 1 | Tom ; 2 | Jerry A new parquet file is created say 1.parquet with these 2 entries Batch 2 - Inserted 2 records --> 1 | Mickey ; 3 | Donald . So here primary key with 1 is updated from Tom to Mickey A new parquet file is created say 2.parquet with following entries - 1 | Mickey (Record Updated) 2 | Jerry (Record Not changed and retained) 3 | Donald (New Record) Now, when I query I get (1 | Mickey) but I never get (1 | Tom) as its in old parquet file. So doesn't incremental query run on old parquet files ? I can use plain vanilla spark to achieve but is there any better way to get the audit history of updated rows using HUDI 1) Using spark I can read all parquet files (without hoodie) - spark.read().load(hudiConfig.getBasePath() + hudiConfig.getTableName() + "//*//*//*.parquet");
