Great !! Got it working !!
'hoodie.datasource.write.recordkey.field': 'COL1,COL2', 'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.ComplexKeyGenerator', Thank you. On Thu, Jul 16, 2020 at 7:10 PM Adam Feldman <[email protected]> wrote: > Hi Sivaprakash, > To be able to specify multiple keys, in a comma separated notation, you > must also set the KEYGENERATOR_CLASS_OPT_KEY to > classOf[ComplexKeyGenerator].getName. Please see description here: > https://hudi.apache.org/docs/writing_data.html#datasource-writer. > > Note: RECORDKEY_FIELD_OPT_KEY is the key variable mapped to the > hoodie.datasource.write.recordkey.field configuration. > > Thanks, > > Adam Feldman > > On Thu, Jul 16, 2020 at 1:00 PM Sivaprakash < > [email protected]> > wrote: > > > Looks like this property does the trick > > > > Property: hoodie.datasource.write.recordkey.field, Default: uuid > > Record key field. Value to be used as the recordKey component of > HoodieKey. > > Actual value will be obtained by invoking .toString() on the field value. > > Nested fields can be specified using the dot notation eg: a.b.c > > > > However I couldn't provide more than one column like this... COL1.COL2 > > > > 'hoodie.datasource.write.recordkey.field: 'COL1.COL2' > > > > Anything wrong with the syntax? (tried with comma as well) > > > > > > On Thu, Jul 16, 2020 at 6:41 PM Sivaprakash < > > [email protected]> > > wrote: > > > > > Hello Balaji > > > > > > Thank you for your info !! > > > > > > I tried those options but what I find is (I'm trying to understand how > > > hudi internally manages its files) > > > > > > First Write > > > > > > 1. > > > > > > ('NR001', 'YXXXTRE', 'YXXXTRE_445343') > > > ('NR002', 'YYYYTRE', 'YYYYTRE_445343') > > > ('NR003', 'YZZZTRE', 'YZZZTRE_445343') > > > > > > Commit time for all the records 20200716212533 > > > > > > 2. > > > > > > ('NR001', 'YXXXTRE', 'YXXXTRE_445343') > > > ('NR002', 'ZYYYTRE', 'ZYYYTRE_445343') > > > ('NR003', 'YZZZTRE', 'YZZZTRE_445343') > > > > > > (There is only one record change in my new dataset other two records > are > > > same as 1 but after snapshot/incremental read I see that commit time is > > > updated for all 3 records) > > > > > > Commit time for all the records 20200716214544 > > > > > > > > > - Does it mean that Hudi re-creates 3 records again? I thought it > > > would create only the 2nd record > > > - Trying to understand the storage volume efficiency here > > > - Some configuration has to be enabled to fix this? > > > > > > configuration that I use > > > > > > - COPY_ON_WRITE, Append, Upsert > > > - First Column (NR001) is configured as > > > *hoodie.datasource.write.recordkey.field* > > > > > > > > > > > > > > > On Thu, Jul 16, 2020 at 6:10 PM Balaji Varadarajan > > > <[email protected]> wrote: > > > > > >> Hi Sivaprakash, > > >> Uniqueness of records is determined by the record key you specify to > > >> hudi. Hudi supports filtering out existing records (by record key). By > > >> default, it would upsert all incoming records. > > >> Please look at > > >> > > > https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-HowdoesHudihandleduplicaterecordkeysinaninput > > for > > >> information on how to dedupe records based on record key. > > >> > > >> Balaji.V > > >> On Thursday, July 16, 2020, 04:23:22 AM PDT, Sivaprakash < > > >> [email protected]> wrote: > > >> > > >> This might be a basic question - I'm experimenting with Hudi > > (Pyspark). I > > >> have used Insert/Upsert options to write delta into my data lake. > > However, > > >> one is not clear to me > > >> > > >> Step 1:- I write 50 records > > >> Step 2:- Im writing 50 records out of which only *10 records have been > > >> changed* (I'm using upsert mode & tried with MERGE_ON_READ also > > >> COPY_ON_WRITE) > > >> Step 3: I was expecting only 10 records will be written but it writes > > >> whole > > >> 50 records is this a normal behaviour? Which means do I need to > > determine > > >> the delta myself and write them alone? > > >> > > >> Am I missing something? > > >> > > > > > > > > > > > > -- > > > - Prakash. > > > > > > > > > -- > > - Prakash. > > > -- - Prakash.
