[GitHub] [hudi] santas-little-helper-13 opened a new issue #2277: [SUPPORT]

GitBox Tue, 24 Nov 2020 03:06:57 -0800


santas-little-helper-13 opened a new issue #2277:
URL: https://github.com/apache/hudi/issues/2277



   Hi,
   
   I am working with hudi in AWS Glue. I have a problem with hudi updates.
   
   So I have one Glue job that inserts data into hudi parquet files, it reads 
data from glue table, does some processing, gets max ID_key from already 
existing data and adds it to the row number in order for Id_key to be unique on 
the whole table level.
   Now I have the other Glue job in which I read from that hudi table:
   
   `hudiDF = spark.read.format("hudi").load('s3://prct-parquet-tgt/test_task1' 
+ "/*")`
   
   limit it to just one record and make changes in one column and in column 
upd_ind which is precombine field (all records have 0 by default as upd_ind):
   
   `updateDF = hudiDF.limit(1).withColumn('sequence', 
lit('new_value')).withColumn('upd_ind', lit(1))`
   
   then I define hudi options:
   
   ```
   hoodie_write_options = {
        'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
        'hoodie.parquet.compression.codec': 'snappy',
        'hoodie.table.name': 'test_task1',
        'hoodie.datasource.write.recordkey.field': 'ID_key',
        'hoodie.datasource.write.hive_style_partitioning': True,
        'hoodie.datasource.write.table.name': 'test_task1',
        'hoodie.datasource.write.operation': 'upsert',
        'hoodie.datasource.write.precombine.field': 'upd_ind', 
        'hoodie.datasource.write.insert.drop.duplicates': True,
        'hoodie.datasource.write.partitionpath.field': "datehour",
        'hoodie.upsert.shuffle.parallelism': 8,
        'hoodie.insert.shuffle.parallelism': 8,
        'hoodie.datasource.write.keygenerator.class': 
'org.apache.hudi.keygen.ComplexKeyGenerator',
        'hoodie.parquet.small.file.limit': 0
   }
   ```
   
   and write the updated row:
   
   
`updateDF.write.format('hudi').options(**hoodie_write_options).mode('append').save('s3://prct-parquet-tgt/test_task1')`
   
   The problem is that the record that gets updated is random and has no 
connection to the record that is shown in Glue job.
   If I define specific record, then update isn’t done at all:
   
   `updateDF = hudiDF.filter(col('ID_key')==64777).withColumn('sequence', 
lit('new_value')).withColumn('upd_ind', lit(1))`
   
   I need to update the exact record that I specify. Please help.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] santas-little-helper-13 opened a new issue #2277: [SUPPORT]

Reply via email to