RuyRoaV commented on issue #11959: URL: https://github.com/apache/hudi/issues/11959#issuecomment-2531500482
Hi, Sorry for the delay, i got stuck in some other tasks. However, I am yet to figure out why this behaviour was observed. What we did was to rollback to Hudi 0.11.1 and then try again once we have more clarity on what happened. 1. Concerning Redshift Spectrum The schema in a Redshift cluster that we have was created in a similar fashion as this: `create external schema if not exists redshift_jdbc from DATA CATALOG database 'lakeformation_dbl' iam_role 'arn:aws:iam::<account-id>:role/<role-name>' region '<region>';` The Redshift cluster is located in a different AWS account as the Glue jobs used to update the tables. Therefore, we used Lake Formation to share the resources from one account to the other. Since the partition keys of the table are `cyear`, `cmonth`, `cday`. The query to take look into the row count was: ``` SELECT cyear, cmonth, cday, COUNT(*) FROM db.table GROUP BY 1,2,3 ``` 2. Concerning Athena The query was the same as the one ran in Redshift Spectrum ``` SELECT cyear, cmonth, cday, COUNT(*) FROM db.table GROUP BY cyear, cmonth, cday ``` This allowed us to identify that fewer records were shown after each day that had passed. Moreover, a Glue job using Spark was run to verify the row count and we saw the same behaviour as in Athena. 3. There doesn't seem to be any difference in how the tables were created. The table is updated every 10-15 minutes, upserting ~1.4M rows each time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
