Thanks Satish. It worked and sorry to bother you. One last query on this - 1) How much history of commits we retain by default per partition (https://hudi.apache.org/docs/configurations.html#withCompactionConfig) Is this link apply for MOR only ? I have a COW table. 2) If I need to keep history of commits forever what I need to do ? 3) On cleanup, does hudi only cleans the log files or does it clean parquet files as well.
I tried to find out in wiki but couldn't get much info so if you have some link please provide it. Thanks for all your help !! On 2020/05/29 18:55:39, Satish Kotha <[email protected]> wrote: > Hello > > But I can try again if you believe that incremental query scans through all > > the parquet files and not just the latest one. > > > > parquet files are selected based on 'BEGIN_INSTANTTIME_OPT_KEY' and > 'END_INSTANTTIME_OPT_KEY' for incremental queries. Also, worth noting that > BEGIN_INSTANTTIME is exclusive and END_INSTANTTIME is inclusive. So, for > your example, if 'BEGIN..' is set to 0 and 'END' is set to batch1 > timestamp, then *only* batch1 version of the parquet file will be read. > Please try this. If this doesn't work, it would be great if you can share > exact commands you are running. I can try to reproduce and debug. > > Thanks > Satish > > On Fri, May 29, 2020 at 11:21 AM tanu dua <[email protected]> wrote: > > > Yes I followed the following and accordingly wrote the queries. > > I believe the difference is primary key selection as in the examples below > > the primary key is always unique like uuid which means that every data > > ingestion will be insert and hence old and new records will be in the > > latest parquet file. > > In my case primary key is not always unique and hence update will trigger > > and new file will have updated value and not the old value. > > > > But I can try again if you believe that incremental query scans through all > > the parquet files and not just the latest one. > > > > On Fri, 29 May 2020 at 10:48 PM, Satish Kotha <[email protected] > > > > > wrote: > > > > > Hi, > > > > > > > > > > Now, when I query I get (1 | Mickey) but I never get (1 | Tom) as its > > in > > > > old parquet file. So doesn't incremental query run on old parquet > > files ? > > > > > > > > > > Could you share the command you are using for incremental query? > > Specific > > > config is required by hoodie for doing incremental queries. Please see > > > example > > > here > > > < > > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__hudi.apache.org_docs_docker-5Fdemo.html-23step-2D7-2Db-2Dincremental-2Dquery-2Dwith-2Dspark-2Dsql&d=DwIBaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=4xNSsHvHqd0Eym5a_ZpDVwlq_iJaZ0Rdk0u0SMLXZ0c&m=Lls5B9z4HVPJslbQ2LJ8nGaP7xXfdxOGoN9wj9WJMkQ&s=bsqMsV3P7IawqkuIGP5LnLoXHQi0_tVfSpeUXSmXiBE&e= > > > > > > > and > > > more documentation here > > > < > > https://urldefense.proofpoint.com/v2/url?u=https-3A__hudi.apache.org_docs_querying-5Fdata.html-23spark-2Dincr-2Dquery&d=DwIBaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=4xNSsHvHqd0Eym5a_ZpDVwlq_iJaZ0Rdk0u0SMLXZ0c&m=Lls5B9z4HVPJslbQ2LJ8nGaP7xXfdxOGoN9wj9WJMkQ&s=LfsE3x_CQNkkEzQbL0_5QhTaWI9kfqzVDFSqnykQKgc&e= > > >. Please > > > try this and let me know if it works as expected. > > > > > > Thanks > > > Satish > > > > > > On Fri, May 29, 2020 at 5:18 AM tanujdua <[email protected]> wrote: > > > > > > > Hi, > > > > We have a requirement where we keep audit_history of every change and > > > > sometimes query on that as well. In RDBMS we have separate tables for > > > > audit_history. However in HUDI, history is being created at every > > > ingestion > > > > and I want to leverage so I do have a question on incremental query. > > > > Does incremental query runs on latest parquet file or on all the > > parquet > > > > files in the partition ? I can see it runs only on latest parquet file. > > > > > > > > Let me illustrate more what we need. For eg we have data with 2 > > columns - > > > > (id | name) where id is the primary key. > > > > > > > > Batch 1 - > > > > Inserted 2 record --> 1 | Tom ; 2 | Jerry > > > > A new parquet file is created say 1.parquet with these 2 entries > > > > > > > > Batch 2 - > > > > Inserted 2 records --> 1 | Mickey ; 3 | Donald . So here primary key > > > with > > > > 1 is updated from Tom to Mickey > > > > A new parquet file is created say 2.parquet with following entries - > > > > 1 | Mickey (Record Updated) > > > > 2 | Jerry (Record Not changed and retained) > > > > 3 | Donald (New Record) > > > > > > > > Now, when I query I get (1 | Mickey) but I never get (1 | Tom) as its > > in > > > > old parquet file. So doesn't incremental query run on old parquet > > files ? > > > > > > > > I can use plain vanilla spark to achieve but is there any better way to > > > > get the audit history of updated rows using HUDI > > > > 1) Using spark I can read all parquet files (without hoodie) - > > > > spark.read().load(hudiConfig.getBasePath() + hudiConfig.getTableName() > > + > > > > "//*//*//*.parquet"); > > > > > > > > > > > > > > > > > > > > > >
