Re: Query Incremental Updates on same primary key

tanujdua Fri, 29 May 2020 20:04:13 -0700

Thanks Satish. It worked and sorry to bother you.
One last query on this - 
1) How much history of commits we retain by default per partition 
(https://hudi.apache.org/docs/configurations.html#withCompactionConfig) Is this 
link apply for MOR only ? I have a COW table.
2) If I need to keep history of commits forever what I need to do ?
3) On cleanup, does hudi only cleans the log files or does it clean parquet 
files as well.


I tried to find out in wiki but couldn't get much info so if you have some link 
please provide it.
Thanks for all your help !!



On 2020/05/29 18:55:39, Satish Kotha <[email protected]> wrote: 
> Hello
> 
> But I can try again if you believe that incremental query scans through all
> > the parquet files and not just the latest one.
> >
> 
> parquet files are selected based on 'BEGIN_INSTANTTIME_OPT_KEY' and
> 'END_INSTANTTIME_OPT_KEY' for incremental queries. Also, worth noting that
> BEGIN_INSTANTTIME is exclusive and END_INSTANTTIME is inclusive. So, for
> your example, if 'BEGIN..' is set to 0 and 'END' is set to batch1
> timestamp, then *only* batch1 version of the parquet file will be read.
> Please try this. If this doesn't work, it would be great if you can share
> exact commands you are running. I can try to reproduce and debug.
> 
> Thanks
> Satish
> 
> On Fri, May 29, 2020 at 11:21 AM tanu dua <[email protected]> wrote:
> 
> > Yes I followed the following and accordingly wrote the queries.
> > I believe the difference is primary key selection as in the examples below
> > the primary key is always unique like uuid which means that every data
> > ingestion will be insert and hence old and new records will be in the
> > latest parquet file.
> > In my case primary key is not always unique and hence update will trigger
> > and new file will have updated value and not the old value.
> >
> > But I can try again if you believe that incremental query scans through all
> > the parquet files and not just the latest one.
> >
> > On Fri, 29 May 2020 at 10:48 PM, Satish Kotha <[email protected]
> > >
> > wrote:
> >
> > > Hi,
> > >
> > >
> > > > Now, when I query I get (1 | Mickey) but I never get (1 | Tom) as its
> > in
> > > > old parquet file. So doesn't incremental query run on old parquet
> > files ?
> > > >
> > >
> > > Could you share the command you are using for incremental query?
> > Specific
> > > config is required by hoodie for doing incremental queries. Please see
> > > example
> > > here
> > > <
> > >
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__hudi.apache.org_docs_docker-5Fdemo.html-23step-2D7-2Db-2Dincremental-2Dquery-2Dwith-2Dspark-2Dsql&d=DwIBaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=4xNSsHvHqd0Eym5a_ZpDVwlq_iJaZ0Rdk0u0SMLXZ0c&m=Lls5B9z4HVPJslbQ2LJ8nGaP7xXfdxOGoN9wj9WJMkQ&s=bsqMsV3P7IawqkuIGP5LnLoXHQi0_tVfSpeUXSmXiBE&e=
> > > >
> > > and
> > > more documentation here
> > > <
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__hudi.apache.org_docs_querying-5Fdata.html-23spark-2Dincr-2Dquery&d=DwIBaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=4xNSsHvHqd0Eym5a_ZpDVwlq_iJaZ0Rdk0u0SMLXZ0c&m=Lls5B9z4HVPJslbQ2LJ8nGaP7xXfdxOGoN9wj9WJMkQ&s=LfsE3x_CQNkkEzQbL0_5QhTaWI9kfqzVDFSqnykQKgc&e=
> > >. Please
> > > try this and let me know if it works as expected.
> > >
> > > Thanks
> > > Satish
> > >
> > > On Fri, May 29, 2020 at 5:18 AM tanujdua <[email protected]> wrote:
> > >
> > > > Hi,
> > > > We have a requirement where we keep audit_history of every change and
> > > > sometimes query on that as well. In RDBMS we have separate tables for
> > > > audit_history. However in HUDI, history is being created at every
> > > ingestion
> > > > and I want to leverage so I do have a question on incremental query.
> > > > Does incremental query runs on latest parquet file or on all the
> > parquet
> > > > files in the partition ? I can see it runs only on latest parquet file.
> > > >
> > > > Let me illustrate more what we need. For eg we have data with 2
> > columns -
> > > > (id | name) where id is the primary key.
> > > >
> > > > Batch 1 -
> > > > Inserted 2 record --> 1 | Tom ; 2 | Jerry
> > > > A new parquet file is created say 1.parquet with these 2 entries
> > > >
> > > > Batch 2 -
> > > > Inserted 2 records --> 1 | Mickey  ; 3 | Donald . So here primary key
> > > with
> > > > 1 is updated from Tom to Mickey
> > > > A new parquet file is created say 2.parquet with following entries -
> > > > 1 | Mickey (Record Updated)
> > > > 2 | Jerry (Record Not changed and retained)
> > > > 3 | Donald (New Record)
> > > >
> > > > Now, when I query I get (1 | Mickey) but I never get (1 | Tom) as its
> > in
> > > > old parquet file. So doesn't incremental query run on old parquet
> > files ?
> > > >
> > > > I can use plain vanilla spark to achieve but is there any better way to
> > > > get the audit history of updated rows using HUDI
> > > > 1) Using spark I can read all parquet files (without hoodie) -
> > > > spark.read().load(hudiConfig.getBasePath() + hudiConfig.getTableName()
> > +
> > > > "//*//*//*.parquet");
> > > >
> > > >
> > > >
> > > >
> > >
> >
>

Re: Query Incremental Updates on same primary key

Reply via email to