Hello But I can try again if you believe that incremental query scans through all > the parquet files and not just the latest one. >
parquet files are selected based on 'BEGIN_INSTANTTIME_OPT_KEY' and 'END_INSTANTTIME_OPT_KEY' for incremental queries. Also, worth noting that BEGIN_INSTANTTIME is exclusive and END_INSTANTTIME is inclusive. So, for your example, if 'BEGIN..' is set to 0 and 'END' is set to batch1 timestamp, then *only* batch1 version of the parquet file will be read. Please try this. If this doesn't work, it would be great if you can share exact commands you are running. I can try to reproduce and debug. Thanks Satish On Fri, May 29, 2020 at 11:21 AM tanu dua <[email protected]> wrote: > Yes I followed the following and accordingly wrote the queries. > I believe the difference is primary key selection as in the examples below > the primary key is always unique like uuid which means that every data > ingestion will be insert and hence old and new records will be in the > latest parquet file. > In my case primary key is not always unique and hence update will trigger > and new file will have updated value and not the old value. > > But I can try again if you believe that incremental query scans through all > the parquet files and not just the latest one. > > On Fri, 29 May 2020 at 10:48 PM, Satish Kotha <[email protected] > > > wrote: > > > Hi, > > > > > > > Now, when I query I get (1 | Mickey) but I never get (1 | Tom) as its > in > > > old parquet file. So doesn't incremental query run on old parquet > files ? > > > > > > > Could you share the command you are using for incremental query? > Specific > > config is required by hoodie for doing incremental queries. Please see > > example > > here > > < > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__hudi.apache.org_docs_docker-5Fdemo.html-23step-2D7-2Db-2Dincremental-2Dquery-2Dwith-2Dspark-2Dsql&d=DwIBaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=4xNSsHvHqd0Eym5a_ZpDVwlq_iJaZ0Rdk0u0SMLXZ0c&m=Lls5B9z4HVPJslbQ2LJ8nGaP7xXfdxOGoN9wj9WJMkQ&s=bsqMsV3P7IawqkuIGP5LnLoXHQi0_tVfSpeUXSmXiBE&e= > > > > > and > > more documentation here > > < > https://urldefense.proofpoint.com/v2/url?u=https-3A__hudi.apache.org_docs_querying-5Fdata.html-23spark-2Dincr-2Dquery&d=DwIBaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=4xNSsHvHqd0Eym5a_ZpDVwlq_iJaZ0Rdk0u0SMLXZ0c&m=Lls5B9z4HVPJslbQ2LJ8nGaP7xXfdxOGoN9wj9WJMkQ&s=LfsE3x_CQNkkEzQbL0_5QhTaWI9kfqzVDFSqnykQKgc&e= > >. Please > > try this and let me know if it works as expected. > > > > Thanks > > Satish > > > > On Fri, May 29, 2020 at 5:18 AM tanujdua <[email protected]> wrote: > > > > > Hi, > > > We have a requirement where we keep audit_history of every change and > > > sometimes query on that as well. In RDBMS we have separate tables for > > > audit_history. However in HUDI, history is being created at every > > ingestion > > > and I want to leverage so I do have a question on incremental query. > > > Does incremental query runs on latest parquet file or on all the > parquet > > > files in the partition ? I can see it runs only on latest parquet file. > > > > > > Let me illustrate more what we need. For eg we have data with 2 > columns - > > > (id | name) where id is the primary key. > > > > > > Batch 1 - > > > Inserted 2 record --> 1 | Tom ; 2 | Jerry > > > A new parquet file is created say 1.parquet with these 2 entries > > > > > > Batch 2 - > > > Inserted 2 records --> 1 | Mickey ; 3 | Donald . So here primary key > > with > > > 1 is updated from Tom to Mickey > > > A new parquet file is created say 2.parquet with following entries - > > > 1 | Mickey (Record Updated) > > > 2 | Jerry (Record Not changed and retained) > > > 3 | Donald (New Record) > > > > > > Now, when I query I get (1 | Mickey) but I never get (1 | Tom) as its > in > > > old parquet file. So doesn't incremental query run on old parquet > files ? > > > > > > I can use plain vanilla spark to achieve but is there any better way to > > > get the audit history of updated rows using HUDI > > > 1) Using spark I can read all parquet files (without hoodie) - > > > spark.read().load(hudiConfig.getBasePath() + hudiConfig.getTableName() > + > > > "//*//*//*.parquet"); > > > > > > > > > > > > > > >
