codope commented on issue #7757: URL: https://github.com/apache/hudi/issues/7757#issuecomment-1406062367
@pushpavanthar Thanks for sharing the details. Few notes on the validation: 1. The filter for `raw_table` and `hudi_table` is not the same. 2. There could be a mismatch on hourly basis because dynamic allocation is enabled and executors could get lost resulting in failure of a deltastreamer round. However, the checkpoint would not be updated so the records will get picked up in the subsequent deltastreamer round. But in this case, it is possible to see less records in hudi table than the raw table. It depends on when the validation query ran and how far behind the source table is the deltastreamer lagging (essentially due to lag between processing time and event time). 3. For the hudi_table, it would be helpful to collect `_hoodie_commit_time` values for a particular `created_at_dt` value. Then, we can look into the timeline around that commit time for further debugging. 4. I understand you are running in continuous mode and source data is being continuously updated. But, let's say you know for sure that yesterday's data has been processed, then can you do a validation of record count across both tables with a simple filter on created_at like `select count(*) from table where created_at = '<yesterday date>'`. Few notes on the configuration: 1. Record key is `id` and precombine field is `_lsn`. I am assuming all records in the source table have unique `id`, otherwise Hudi will dedup the records based on `_lsn` and there could be lesser number of absolute records in Hudi table. 2. I see that schema registry is being used as the schema provider but `--schemaprovider-class` is set to `NullTargetSchemaRegistryProvider`. Is it because there is some additional transformation on the source data before ingesting into Hudi table and your requirement is the Hudi be able to infer the target schema? If so, can you also share what the transformer is doing in this case. 3. Is it possible disable dynamic allocation for a few hours or a day and then validate records processed for that day? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
