Hi,
We have a requirement where we keep audit_history of every change and sometimes 
query on that as well. In RDBMS we have separate tables for audit_history. 
However in HUDI, history is being created at every ingestion and I want to 
leverage so I do have a question on incremental query. 
Does incremental query runs on latest parquet file or on all the parquet files 
in the partition ? I can see it runs only on latest parquet file.

Let me illustrate more what we need. For eg we have data with 2 columns - (id | 
name) where id is the primary key.

Batch 1 - 
Inserted 2 record --> 1 | Tom ; 2 | Jerry
A new parquet file is created say 1.parquet with these 2 entries

Batch 2 -
Inserted 2 records --> 1 | Mickey  ; 3 | Donald . So here primary key with 1 is 
updated from Tom to Mickey
A new parquet file is created say 2.parquet with following entries -
1 | Mickey (Record Updated)
2 | Jerry (Record Not changed and retained)
3 | Donald (New Record)

Now, when I query I get (1 | Mickey) but I never get (1 | Tom) as its in old 
parquet file. So doesn't incremental query run on old parquet files ?

I can use plain vanilla spark to achieve but is there any better way to get the 
audit history of updated rows using HUDI
1) Using spark I can read all parquet files (without hoodie) -  
spark.read().load(hudiConfig.getBasePath() + hudiConfig.getTableName() + 
"//*//*//*.parquet");


 

Reply via email to