Re: Query Incremental Updates on same primary key

Satish Kotha Fri, 29 May 2020 11:56:14 -0700

Hello

But I can try again if you believe that incremental query scans through all
> the parquet files and not just the latest one.
>


parquet files are selected based on 'BEGIN_INSTANTTIME_OPT_KEY' and
'END_INSTANTTIME_OPT_KEY' for incremental queries. Also, worth noting that
BEGIN_INSTANTTIME is exclusive and END_INSTANTTIME is inclusive. So, for
your example, if 'BEGIN..' is set to 0 and 'END' is set to batch1
timestamp, then *only* batch1 version of the parquet file will be read.
Please try this. If this doesn't work, it would be great if you can share
exact commands you are running. I can try to reproduce and debug.

Thanks
Satish

On Fri, May 29, 2020 at 11:21 AM tanu dua <[email protected]> wrote:

> Yes I followed the following and accordingly wrote the queries.
> I believe the difference is primary key selection as in the examples below
> the primary key is always unique like uuid which means that every data
> ingestion will be insert and hence old and new records will be in the
> latest parquet file.
> In my case primary key is not always unique and hence update will trigger
> and new file will have updated value and not the old value.
>
> But I can try again if you believe that incremental query scans through all
> the parquet files and not just the latest one.
>
> On Fri, 29 May 2020 at 10:48 PM, Satish Kotha <[email protected]
> >
> wrote:
>
> > Hi,
> >
> >
> > > Now, when I query I get (1 | Mickey) but I never get (1 | Tom) as its
> in
> > > old parquet file. So doesn't incremental query run on old parquet
> files ?
> > >
> >
> > Could you share the command you are using for incremental query?
> Specific
> > config is required by hoodie for doing incremental queries. Please see
> > example
> > here
> > <
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__hudi.apache.org_docs_docker-5Fdemo.html-23step-2D7-2Db-2Dincremental-2Dquery-2Dwith-2Dspark-2Dsql&d=DwIBaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=4xNSsHvHqd0Eym5a_ZpDVwlq_iJaZ0Rdk0u0SMLXZ0c&m=Lls5B9z4HVPJslbQ2LJ8nGaP7xXfdxOGoN9wj9WJMkQ&s=bsqMsV3P7IawqkuIGP5LnLoXHQi0_tVfSpeUXSmXiBE&e=
> > >
> > and
> > more documentation here
> > <
> https://urldefense.proofpoint.com/v2/url?u=https-3A__hudi.apache.org_docs_querying-5Fdata.html-23spark-2Dincr-2Dquery&d=DwIBaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=4xNSsHvHqd0Eym5a_ZpDVwlq_iJaZ0Rdk0u0SMLXZ0c&m=Lls5B9z4HVPJslbQ2LJ8nGaP7xXfdxOGoN9wj9WJMkQ&s=LfsE3x_CQNkkEzQbL0_5QhTaWI9kfqzVDFSqnykQKgc&e=
> >. Please
> > try this and let me know if it works as expected.
> >
> > Thanks
> > Satish
> >
> > On Fri, May 29, 2020 at 5:18 AM tanujdua <[email protected]> wrote:
> >
> > > Hi,
> > > We have a requirement where we keep audit_history of every change and
> > > sometimes query on that as well. In RDBMS we have separate tables for
> > > audit_history. However in HUDI, history is being created at every
> > ingestion
> > > and I want to leverage so I do have a question on incremental query.
> > > Does incremental query runs on latest parquet file or on all the
> parquet
> > > files in the partition ? I can see it runs only on latest parquet file.
> > >
> > > Let me illustrate more what we need. For eg we have data with 2
> columns -
> > > (id | name) where id is the primary key.
> > >
> > > Batch 1 -
> > > Inserted 2 record --> 1 | Tom ; 2 | Jerry
> > > A new parquet file is created say 1.parquet with these 2 entries
> > >
> > > Batch 2 -
> > > Inserted 2 records --> 1 | Mickey  ; 3 | Donald . So here primary key
> > with
> > > 1 is updated from Tom to Mickey
> > > A new parquet file is created say 2.parquet with following entries -
> > > 1 | Mickey (Record Updated)
> > > 2 | Jerry (Record Not changed and retained)
> > > 3 | Donald (New Record)
> > >
> > > Now, when I query I get (1 | Mickey) but I never get (1 | Tom) as its
> in
> > > old parquet file. So doesn't incremental query run on old parquet
> files ?
> > >
> > > I can use plain vanilla spark to achieve but is there any better way to
> > > get the audit history of updated rows using HUDI
> > > 1) Using spark I can read all parquet files (without hoodie) -
> > > spark.read().load(hudiConfig.getBasePath() + hudiConfig.getTableName()
> +
> > > "//*//*//*.parquet");
> > >
> > >
> > >
> > >
> >
>

Re: Query Incremental Updates on same primary key

Reply via email to