soma1712 opened a new issue, #7659:
URL: https://github.com/apache/hudi/issues/7659

   Hi,
   
   We have a situation to load some tables incrementally from Oracle to S3 and 
build HUDI tables. During this process, we are missing records when the 
spark-submits are executed to apply the deltas on the existing data. Below are 
the steps we followed in sequence to load the tables.
   
   1. We extracted all the data from Oracle table using AWS DMS and created 
parquet files in S3 bucket as a onetime activity for initial load.
   2. Executed Spark-Submits to convert the parquet files into HUDI table on S3 
and synced the table to Glue Catalog.
   3. Query the data in HUDI table and validate it against golden source Oracle 
to make sure it looks good.
   4. Enabled AWS DMS CDC tasks on the same table to read data incrementally 
from redo logs and save them as parquet in S3 bucket.
   5. Executed Spark-Submits to apply these delta records on to the initially 
created HUDI table.
   6. Validated to check if the deltas are applied to the table.
   
   During data validation, we figured out that the records are available in S3 
CDC bucket but the HUDI table is not updated with the incremental data. 
   
   Here are the configs and spark submits we used 
   
   Hudi version - 0.7.0
   spark version - 2.4.0
   
[SparkSubmits.txt](https://github.com/apache/hudi/files/10404025/SparkSubmits.txt)
   
[actvy_log_full.txt](https://github.com/apache/hudi/files/10404030/actvy_log_full.txt)
   
[actvy_log_cdc.txt](https://github.com/apache/hudi/files/10404031/actvy_log_cdc.txt)
   
   Please let me know if more details are needed. 
   
   Regards,
   Soma


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to