Re: Query Incremental Updates on same primary key

tanu dua Fri, 29 May 2020 11:22:13 -0700

Yes I followed the following and accordingly wrote the queries.
I believe the difference is primary key selection as in the examples below
the primary key is always unique like uuid which means that every data
ingestion will be insert and hence old and new records will be in the
latest parquet file.
In my case primary key is not always unique and hence update will trigger
and new file will have updated value and not the old value.


But I can try again if you believe that incremental query scans through all
the parquet files and not just the latest one.

On Fri, 29 May 2020 at 10:48 PM, Satish Kotha <[email protected]>
wrote:

> Hi,
>
>
> > Now, when I query I get (1 | Mickey) but I never get (1 | Tom) as its in
> > old parquet file. So doesn't incremental query run on old parquet files ?
> >
>
> Could you share the command you are using for incremental query?  Specific
> config is required by hoodie for doing incremental queries. Please see
> example
> here
> <
> https://hudi.apache.org/docs/docker_demo.html#step-7-b-incremental-query-with-spark-sql
> >
> and
> more documentation here
> <https://hudi.apache.org/docs/querying_data.html#spark-incr-query>. Please
> try this and let me know if it works as expected.
>
> Thanks
> Satish
>
> On Fri, May 29, 2020 at 5:18 AM tanujdua <[email protected]> wrote:
>
> > Hi,
> > We have a requirement where we keep audit_history of every change and
> > sometimes query on that as well. In RDBMS we have separate tables for
> > audit_history. However in HUDI, history is being created at every
> ingestion
> > and I want to leverage so I do have a question on incremental query.
> > Does incremental query runs on latest parquet file or on all the parquet
> > files in the partition ? I can see it runs only on latest parquet file.
> >
> > Let me illustrate more what we need. For eg we have data with 2 columns -
> > (id | name) where id is the primary key.
> >
> > Batch 1 -
> > Inserted 2 record --> 1 | Tom ; 2 | Jerry
> > A new parquet file is created say 1.parquet with these 2 entries
> >
> > Batch 2 -
> > Inserted 2 records --> 1 | Mickey  ; 3 | Donald . So here primary key
> with
> > 1 is updated from Tom to Mickey
> > A new parquet file is created say 2.parquet with following entries -
> > 1 | Mickey (Record Updated)
> > 2 | Jerry (Record Not changed and retained)
> > 3 | Donald (New Record)
> >
> > Now, when I query I get (1 | Mickey) but I never get (1 | Tom) as its in
> > old parquet file. So doesn't incremental query run on old parquet files ?
> >
> > I can use plain vanilla spark to achieve but is there any better way to
> > get the audit history of updated rows using HUDI
> > 1) Using spark I can read all parquet files (without hoodie) -
> > spark.read().load(hudiConfig.getBasePath() + hudiConfig.getTableName() +
> > "//*//*//*.parquet");
> >
> >
> >
> >
>

Re: Query Incremental Updates on same primary key

Reply via email to