Re: Query Incremental Updates on same primary key

Satish Kotha Sat, 30 May 2020 19:57:56 -0700

Hi

1) Please check 'retainCommits'  property. Clean properties are applicable
for COW tables too.
2) You could set retainCommits to a very large number. It is obviously
going to be a lot more expensive to retain all versions. Please consider
that when planning.
3) It cleans older version of parquet files as well.


It worked and sorry to bother you
>
Great that you are able to figure out incremental reads. Happy to help. Let
me know if you have any other questions.

On Fri, May 29, 2020 at 8:03 PM tanujdua <[email protected]> wrote:

> Thanks Satish. It worked and sorry to bother you.
> One last query on this -
> 1) How much history of commits we retain by default per partition (
> https://urldefense.proofpoint.com/v2/url?u=https-3A__hudi.apache.org_docs_configurations.html-23withCompactionConfig&d=DwIBaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=4xNSsHvHqd0Eym5a_ZpDVwlq_iJaZ0Rdk0u0SMLXZ0c&m=7Zgw9kGfDSS99SYbki0Ryo9hG-nOWFkQYwjLuDrAJPU&s=ZGJkuYrz7WCG_GtF8dA8alOv_W_4x2cxoYUGh9IGIs4&e=
> ) Is this link apply for MOR only ? I have a COW table.
> 2) If I need to keep history of commits forever what I need to do ?
> 3) On cleanup, does hudi only cleans the log files or does it clean
> parquet files as well.
>
> I tried to find out in wiki but couldn't get much info so if you have some
> link please provide it.
> Thanks for all your help !!
>
>
>
> On 2020/05/29 18:55:39, Satish Kotha <[email protected]>
> wrote:
> > Hello
> >
> > But I can try again if you believe that incremental query scans through
> all
> > > the parquet files and not just the latest one.
> > >
> >
> > parquet files are selected based on 'BEGIN_INSTANTTIME_OPT_KEY' and
> > 'END_INSTANTTIME_OPT_KEY' for incremental queries. Also, worth noting
> that
> > BEGIN_INSTANTTIME is exclusive and END_INSTANTTIME is inclusive. So, for
> > your example, if 'BEGIN..' is set to 0 and 'END' is set to batch1
> > timestamp, then *only* batch1 version of the parquet file will be read.
> > Please try this. If this doesn't work, it would be great if you can share
> > exact commands you are running. I can try to reproduce and debug.
> >
> > Thanks
> > Satish
> >
> > On Fri, May 29, 2020 at 11:21 AM tanu dua <[email protected]> wrote:
> >
> > > Yes I followed the following and accordingly wrote the queries.
> > > I believe the difference is primary key selection as in the examples
> below
> > > the primary key is always unique like uuid which means that every data
> > > ingestion will be insert and hence old and new records will be in the
> > > latest parquet file.
> > > In my case primary key is not always unique and hence update will
> trigger
> > > and new file will have updated value and not the old value.
> > >
> > > But I can try again if you believe that incremental query scans
> through all
> > > the parquet files and not just the latest one.
> > >
> > > On Fri, 29 May 2020 at 10:48 PM, Satish Kotha
> <[email protected]
> > > >
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > >
> > > > > Now, when I query I get (1 | Mickey) but I never get (1 | Tom) as
> its
> > > in
> > > > > old parquet file. So doesn't incremental query run on old parquet
> > > files ?
> > > > >
> > > >
> > > > Could you share the command you are using for incremental query?
> > > Specific
> > > > config is required by hoodie for doing incremental queries. Please
> see
> > > > example
> > > > here
> > > > <
> > > >
> > >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__hudi.apache.org_docs_docker-5Fdemo.html-23step-2D7-2Db-2Dincremental-2Dquery-2Dwith-2Dspark-2Dsql&d=DwIBaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=4xNSsHvHqd0Eym5a_ZpDVwlq_iJaZ0Rdk0u0SMLXZ0c&m=Lls5B9z4HVPJslbQ2LJ8nGaP7xXfdxOGoN9wj9WJMkQ&s=bsqMsV3P7IawqkuIGP5LnLoXHQi0_tVfSpeUXSmXiBE&e=
> > > > >
> > > > and
> > > > more documentation here
> > > > <
> > >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__hudi.apache.org_docs_querying-5Fdata.html-23spark-2Dincr-2Dquery&d=DwIBaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=4xNSsHvHqd0Eym5a_ZpDVwlq_iJaZ0Rdk0u0SMLXZ0c&m=Lls5B9z4HVPJslbQ2LJ8nGaP7xXfdxOGoN9wj9WJMkQ&s=LfsE3x_CQNkkEzQbL0_5QhTaWI9kfqzVDFSqnykQKgc&e=
> > > >. Please
> > > > try this and let me know if it works as expected.
> > > >
> > > > Thanks
> > > > Satish
> > > >
> > > > On Fri, May 29, 2020 at 5:18 AM tanujdua <[email protected]>
> wrote:
> > > >
> > > > > Hi,
> > > > > We have a requirement where we keep audit_history of every change
> and
> > > > > sometimes query on that as well. In RDBMS we have separate tables
> for
> > > > > audit_history. However in HUDI, history is being created at every
> > > > ingestion
> > > > > and I want to leverage so I do have a question on incremental
> query.
> > > > > Does incremental query runs on latest parquet file or on all the
> > > parquet
> > > > > files in the partition ? I can see it runs only on latest parquet
> file.
> > > > >
> > > > > Let me illustrate more what we need. For eg we have data with 2
> > > columns -
> > > > > (id | name) where id is the primary key.
> > > > >
> > > > > Batch 1 -
> > > > > Inserted 2 record --> 1 | Tom ; 2 | Jerry
> > > > > A new parquet file is created say 1.parquet with these 2 entries
> > > > >
> > > > > Batch 2 -
> > > > > Inserted 2 records --> 1 | Mickey  ; 3 | Donald . So here primary
> key
> > > > with
> > > > > 1 is updated from Tom to Mickey
> > > > > A new parquet file is created say 2.parquet with following entries
> -
> > > > > 1 | Mickey (Record Updated)
> > > > > 2 | Jerry (Record Not changed and retained)
> > > > > 3 | Donald (New Record)
> > > > >
> > > > > Now, when I query I get (1 | Mickey) but I never get (1 | Tom) as
> its
> > > in
> > > > > old parquet file. So doesn't incremental query run on old parquet
> > > files ?
> > > > >
> > > > > I can use plain vanilla spark to achieve but is there any better
> way to
> > > > > get the audit history of updated rows using HUDI
> > > > > 1) Using spark I can read all parquet files (without hoodie) -
> > > > > spark.read().load(hudiConfig.getBasePath() +
> hudiConfig.getTableName()
> > > +
> > > > > "//*//*//*.parquet");
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: Query Incremental Updates on same primary key

Reply via email to