AkshayChan opened a new issue #2901:
URL: https://github.com/apache/hudi/issues/2901
When using multiple primary keys and multiple partitions, Hudi simply
inserts a new record instead of updating the record, when we change some
fields/columns in the record (not the primary key or precombine field). We are
writing the data to Amazon S3 and visualizing the data using Amazon Athena.
Please find our Hudi configurations below:
`primary_keys` and `partition_keys` are python lists containing the set of
keys respectively, `args['precombine_field']` and `args['database']` are AWS
Glue job parameters which are python strings and the `table` variable is a
python string as well.
`insert_parallelism` is an integer.
```
commonConfig = {
'className' : 'org.apache.hudi',
'hoodie.datasource.hive_sync.use_jdbc':'false',
'hoodie.datasource.write.precombine.field':
args['precombine_field'],
'hoodie.datasource.write.recordkey.field': ','.join([col for col
in primary_keys]),
'hoodie.table.name': table,
'hoodie.consistency.check.enabled': 'true',
'hoodie.datasource.hive_sync.database': args['database'],
'hoodie.datasource.write.partitionpath.field': ','.join([col for
col in partition_keys]),
'hoodie.datasource.hive_sync.partition_fields': ','.join([col
for col in partition_keys]),
'hoodie.datasource.hive_sync.table': table,
'hoodie.datasource.hive_sync.enable': 'true',
'path': target_path
}
partitionDataConfig = {
'hoodie.datasource.hive_sync.partition_extractor_class':
'org.apache.hudi.hive.MultiPartKeysValueExtractor',
'hoodie.datasource.write.keygenerator.class':
'org.apache.hudi.keygen.ComplexKeyGenerator'
}
initLoadConfig = {
'hoodie.bulkinsert.shuffle.parallelism': insert_parallelism,
'hoodie.datasource.write.operation': 'bulk_insert'
}
```
We have also tried using the below Bloom configurations within
`commonConfig` without any luck.
```
'hoodie.index.type':'GLOBAL_BLOOM',
'hoodie.bloom.index.update.partition.path':'true',
```
This is how we try to write the spark dataframe with the updated records to
S3
`combinedRefinedConf = {**commonRefinedConfig, **partitionRefinedDataConfig,
**initLoadRefinedConfig}`
`glueContext.write_dynamic_frame.from_options(frame =
DynamicFrame.fromDF(df_updated, glueContext, "df_updated"), connection_type =
"marketplace.spark", connection_options = combinedRefinedConf)`
We are using the AWS Glue Connector for Apache Hudi through the AWS Glue
Studio Marketplace
* Hudi version : 0.5.3
* Spark version : 2.4
* AWS Glue version : 2.0
* Storage (HDFS/S3/GCS..) : S3
* Running on Docker? (yes/no) : no
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]