Re: Query Incremental Updates on same primary key

Satish Kotha Fri, 29 May 2020 10:18:50 -0700

Hi,


> Now, when I query I get (1 | Mickey) but I never get (1 | Tom) as its in
> old parquet file. So doesn't incremental query run on old parquet files ?
>

Could you share the command you are using for incremental query?  Specific
config is required by hoodie for doing incremental queries. Please see example
here
<https://hudi.apache.org/docs/docker_demo.html#step-7-b-incremental-query-with-spark-sql>
and
more documentation here
<https://hudi.apache.org/docs/querying_data.html#spark-incr-query>. Please
try this and let me know if it works as expected.

Thanks
Satish

On Fri, May 29, 2020 at 5:18 AM tanujdua <[email protected]> wrote:

> Hi,
> We have a requirement where we keep audit_history of every change and
> sometimes query on that as well. In RDBMS we have separate tables for
> audit_history. However in HUDI, history is being created at every ingestion
> and I want to leverage so I do have a question on incremental query.
> Does incremental query runs on latest parquet file or on all the parquet
> files in the partition ? I can see it runs only on latest parquet file.
>
> Let me illustrate more what we need. For eg we have data with 2 columns -
> (id | name) where id is the primary key.
>
> Batch 1 -
> Inserted 2 record --> 1 | Tom ; 2 | Jerry
> A new parquet file is created say 1.parquet with these 2 entries
>
> Batch 2 -
> Inserted 2 records --> 1 | Mickey  ; 3 | Donald . So here primary key with
> 1 is updated from Tom to Mickey
> A new parquet file is created say 2.parquet with following entries -
> 1 | Mickey (Record Updated)
> 2 | Jerry (Record Not changed and retained)
> 3 | Donald (New Record)
>
> Now, when I query I get (1 | Mickey) but I never get (1 | Tom) as its in
> old parquet file. So doesn't incremental query run on old parquet files ?
>
> I can use plain vanilla spark to achieve but is there any better way to
> get the audit history of updated rows using HUDI
> 1) Using spark I can read all parquet files (without hoodie) -
> spark.read().load(hudiConfig.getBasePath() + hudiConfig.getTableName() +
> "//*//*//*.parquet");
>
>
>
>

Re: Query Incremental Updates on same primary key

Reply via email to