alexone95 opened a new issue, #18272:
URL: https://github.com/apache/hudi/issues/18272
### Bug Description
What happened:
We are facing a problem in a Data Platform based over AWS Cloud
Infrastructure. After upgrading the EMR cluster from version 6.9 to 7.10, we
are experiencing an issue with CDC-updated tables stored on S3. When reading
the data through Redshift procedures, some records are not returned even though
their Hudi commit time is earlier than the query execution time.
What you expected:
Some records with a Hudi commit time earlier than the query time should be
correctly read and returned by Redshift (probably data under the same
partition).
Steps to reproduce:
1. Upgrade the EMR cluster from version 6.9 to 7.10.
2. Run a PySpark script that implements CDC logic and writes data to S3
(Hudi tables).
3. While the PySpark CDC job is running (or after it completes), query the
same data using Redshift procedures and observe that some records are not
returned despite having a commit time earlier than the query time.
type of write operation in pyspark:
df_upsert_u_clean.write.format('org.apache.hudi') \
.option('hoodie.datasource.write.operation', 'upsert') \
.options(**hudiOptions_dfkkop) \
.mode('append') \
.save(S3_OUTPUT_PATH)
### Environment
**Hudi version:** 0.15.0-amzn-7
**Query engine:** PySpark
**Relevant configs:**
'hoodie.table.name': TABLENAME,
'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
'hoodie.datasource.write.recordkey.field': ','.join(key_cols),
'hoodie.datasource.write.keygenerator.class':
'org.apache.hudi.keygen.ComplexKeyGenerator',
'hoodie.datasource.write.partitionpath.field': 'PATH_KEY',
'hoodie.datasource.write.precombine.field': 'TSTMP',
'hoodie.index.type':'BLOOM',
'hoodie.simple.index.update.partition.path':'true',
'hoodie.datasource.write.hive_style_partitioning':'true',
'hoodie.datasource.hive_sync.enable': 'true',
'hoodie.datasource.hive_sync.table': TABLENAME,
'hoodie.datasource.hive_sync.database': 'ods',
'hoodie.datasource.hive_sync.partition_fields': 'PATH_KEY',
'hoodie.datasource.hive_sync.partition_extractor_class':
'org.apache.hudi.hive.MultiPartKeysValueExtractor',
'hoodie.datasource.hive_sync.use_jdbc': 'false',
'hoodie.datasource.hive_sync.mode': 'hms',
'hoodie.copyonwrite.record.size.estimate':3761,
'hoodie.parquet.small.file.limit': 104857600,
'hoodie.parquet.max.file.size': 120000000,
'hoodie.datasource.write.schema.evolution.enable': 'true'
### Logs and Stack Trace
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]