alexone95 opened a new issue, #18272:
URL: https://github.com/apache/hudi/issues/18272

   ### Bug Description
   
   What happened:
   We are facing a problem in a Data Platform based over AWS Cloud 
Infrastructure. After upgrading the EMR cluster from version 6.9 to 7.10, we 
are experiencing an issue with CDC-updated tables stored on S3. When reading 
the data through Redshift procedures, some records are not returned even though 
their Hudi commit time is earlier than the query execution time.
   
   What you expected:
   Some records with a Hudi commit time earlier than the query time should be 
correctly read and returned by Redshift (probably data under the same 
partition).
   
   Steps to reproduce:
   
   1. Upgrade the EMR cluster from version 6.9 to 7.10.
   2. Run a PySpark script that implements CDC logic and writes data to S3 
(Hudi tables).
   3. While the PySpark CDC job is running (or after it completes), query the 
same data using Redshift procedures and observe that some records are not 
returned despite having a commit time earlier than the query time.
   
   type of write operation in pyspark:
   df_upsert_u_clean.write.format('org.apache.hudi') \
               .option('hoodie.datasource.write.operation', 'upsert') \
               .options(**hudiOptions_dfkkop) \
               .mode('append') \
               .save(S3_OUTPUT_PATH)
   
   
   
   ### Environment
   
   **Hudi version:** 0.15.0-amzn-7
   **Query engine:** PySpark
   **Relevant configs:** 
           'hoodie.table.name': TABLENAME,
           'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
           'hoodie.datasource.write.recordkey.field': ','.join(key_cols),
           'hoodie.datasource.write.keygenerator.class': 
'org.apache.hudi.keygen.ComplexKeyGenerator',
           'hoodie.datasource.write.partitionpath.field': 'PATH_KEY',
           'hoodie.datasource.write.precombine.field': 'TSTMP',
           'hoodie.index.type':'BLOOM',
           'hoodie.simple.index.update.partition.path':'true',
           'hoodie.datasource.write.hive_style_partitioning':'true',
           'hoodie.datasource.hive_sync.enable': 'true',
           'hoodie.datasource.hive_sync.table':  TABLENAME,
           'hoodie.datasource.hive_sync.database': 'ods',
           'hoodie.datasource.hive_sync.partition_fields': 'PATH_KEY',
           'hoodie.datasource.hive_sync.partition_extractor_class': 
'org.apache.hudi.hive.MultiPartKeysValueExtractor',
           'hoodie.datasource.hive_sync.use_jdbc': 'false',
           'hoodie.datasource.hive_sync.mode': 'hms',
           'hoodie.copyonwrite.record.size.estimate':3761,
           'hoodie.parquet.small.file.limit': 104857600,
           'hoodie.parquet.max.file.size': 120000000,
           'hoodie.datasource.write.schema.evolution.enable': 'true'
   
   ### Logs and Stack Trace
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to