[GitHub] [hudi] AkshayChan opened a new issue #2901: [SUPPORT] Multiple primary keys + Multiple partitions doesn't work

GitBox Thu, 29 Apr 2021 10:41:25 -0700


AkshayChan opened a new issue #2901:
URL: https://github.com/apache/hudi/issues/2901



   When using multiple primary keys and multiple partitions, Hudi simply 
inserts a new record instead of updating the record, when we change some 
fields/columns in the record (not the primary key or precombine field). We are 
writing the data to Amazon S3 and visualizing the data using Amazon Athena. 
Please find our Hudi configurations below:
   
   `primary_keys` and `partition_keys` are python lists containing the set of 
keys respectively, `args['precombine_field']` and `args['database']` are AWS 
Glue job parameters which are python strings and the `table` variable is a 
python string as well.
   `insert_parallelism` is an integer.
   
   ```
   commonConfig = {
               'className' : 'org.apache.hudi',
               'hoodie.datasource.hive_sync.use_jdbc':'false',
               'hoodie.datasource.write.precombine.field': 
args['precombine_field'],
               'hoodie.datasource.write.recordkey.field': ','.join([col for col 
in primary_keys]),
               'hoodie.table.name': table,
               'hoodie.consistency.check.enabled': 'true',
               'hoodie.datasource.hive_sync.database': args['database'],
               'hoodie.datasource.write.partitionpath.field': ','.join([col for 
col in partition_keys]),
               'hoodie.datasource.hive_sync.partition_fields': ','.join([col 
for col in partition_keys]),
               'hoodie.datasource.hive_sync.table': table,
               'hoodie.datasource.hive_sync.enable': 'true',
               'path': target_path
   }
   partitionDataConfig = {
       'hoodie.datasource.hive_sync.partition_extractor_class': 
'org.apache.hudi.hive.MultiPartKeysValueExtractor',
       'hoodie.datasource.write.keygenerator.class': 
'org.apache.hudi.keygen.ComplexKeyGenerator'
   }
   initLoadConfig = {
       'hoodie.bulkinsert.shuffle.parallelism': insert_parallelism,
       'hoodie.datasource.write.operation': 'bulk_insert'
   }
   
   ```
   
   We have also tried using the below Bloom configurations within 
`commonConfig` without any luck.
   
   ```
   'hoodie.index.type':'GLOBAL_BLOOM',
   'hoodie.bloom.index.update.partition.path':'true',
   ```
   
   This is how we try to write the spark dataframe with the updated records to 
S3
   
   `combinedRefinedConf = {**commonRefinedConfig, **partitionRefinedDataConfig, 
**initLoadRefinedConfig}`
   `glueContext.write_dynamic_frame.from_options(frame = 
DynamicFrame.fromDF(df_updated, glueContext, "df_updated"), connection_type = 
"marketplace.spark", connection_options = combinedRefinedConf)`
   
   
   We are using the AWS Glue Connector for Apache Hudi through the AWS Glue 
Studio Marketplace
   
   * Hudi version : 0.5.3
   
   * Spark version : 2.4
   
   * AWS Glue version : 2.0
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] AkshayChan opened a new issue #2901: [SUPPORT] Multiple primary keys + Multiple partitions doesn't work

Reply via email to