RuyRoaV commented on issue #11959:
URL: https://github.com/apache/hudi/issues/11959#issuecomment-2531500482

   Hi,
   
   Sorry for the delay, i got stuck in some other tasks. However, I am yet to 
figure out why this behaviour was observed. What we did was to rollback to Hudi 
0.11.1 and then try again once we have more clarity on what happened.
   
   
   1. Concerning Redshift Spectrum 
   
   The schema in a Redshift cluster that we have was created in a similar 
fashion as this:
   
   `create external schema if not exists redshift_jdbc from DATA CATALOG 
database 'lakeformation_dbl' iam_role 
'arn:aws:iam::<account-id>:role/<role-name>' region '<region>';`
   
   The Redshift cluster is located in a different AWS account as the Glue jobs 
used to update the tables. Therefore, we used Lake Formation to share the 
resources from one account to the other.
   
   Since the partition keys of the table are `cyear`, `cmonth`, `cday`. The 
query to take look into the row count was:
   
   ```
   SELECT cyear, cmonth, cday, COUNT(*) FROM db.table GROUP BY 1,2,3 
   ```
   
   2. Concerning Athena
   
   The query was the same as the one ran in Redshift Spectrum
   
   ```
   SELECT cyear, cmonth, cday, COUNT(*) FROM db.table GROUP BY cyear, cmonth, 
cday
   ```
   
   This allowed us to identify that fewer records were shown after each day 
that had passed.
   
   Moreover, a Glue job using Spark was run to verify the row count and we saw 
the same behaviour as in Athena.
   
   3. There doesn't seem to be any difference in how the tables were created. 
   The table is updated every 10-15 minutes, upserting ~1.4M rows each time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to