codope commented on issue #7757:
URL: https://github.com/apache/hudi/issues/7757#issuecomment-1406062367

   @pushpavanthar Thanks for sharing the details. Few notes on the validation:
   1. The filter for `raw_table` and `hudi_table` is not the same.
   2. There could be a mismatch on hourly basis because dynamic allocation is 
enabled and executors could get lost resulting in failure of a deltastreamer 
round. However, the checkpoint would not be updated so the records will get 
picked up in the subsequent deltastreamer round. But in this case, it is 
possible to see less records in hudi table than the raw table. It depends on 
when the validation query ran and how far behind the source table is the 
deltastreamer lagging (essentially due to lag between processing time and event 
time). 
   3. For the hudi_table, it would be helpful to collect `_hoodie_commit_time` 
values for a particular `created_at_dt` value. Then, we can look into the 
timeline around that commit time for further debugging.
   4. I understand you are running in continuous mode and source data is being 
continuously updated. But, let's say you know for sure that yesterday's data 
has been processed, then can you do a validation of record count across both 
tables with a simple filter on created_at like `select count(*) from table 
where created_at = '<yesterday date>'`.
   
   Few notes on the configuration:
   1. Record key is `id` and precombine field is `_lsn`. I am assuming all 
records in the source table have unique `id`, otherwise Hudi will dedup the 
records based on `_lsn` and there could be lesser number of absolute records in 
Hudi table.
   2. I see that schema registry is being used as the schema provider but 
`--schemaprovider-class` is set to `NullTargetSchemaRegistryProvider`. Is it 
because there is some additional transformation on the source data before 
ingesting into Hudi table and your requirement is the Hudi be able to infer the 
target schema? If so, can you also share what the transformer is doing in this 
case.
   3. Is it possible disable dynamic allocation for a few hours or a day and 
then validate records processed for that day? 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to