pushpavanthar commented on issue #7757:
URL: https://github.com/apache/hudi/issues/7757#issuecomment-1406298898

   Thanks for looking into this issue @codope. Below is the brief explanation 
to points you mentioned and hope it throws more light on the setup we have.
   1. The data in raw table is written by s3-sink connector which roll files 
every 15 mins and partitioned by date derived from kafka metadata timestamp. 
I'm checking for count of unique primary keys per **created_at hour** (at least 
number of create records should match) for the last 3 days by excluding the 
current hour (to avoid inconsistencies in the current hour due to difference in 
nature of pipeline). Still I have a buffer of scanning 7+ days to account 
outliers to compare data of last 3 days.
   2. For this pipeline we have provided sufficient resources and are 
constantly monitoring for lag. haven't noticed anything strange wrt application 
and cluster. Similar to your observation, I'm suspecting on 
`hoodie.deltastreamer.source.kafka.enable.commit.offset: true` config, which 
lets consumer groups in kafka to manage offsets. There might be a situation 
where consumer offsets are committed to kafka and some failure in the cycle 
might have triggered rollback. Next cycle of `deltasync` would pick from next 
set of offsets, hence miss entire batch of old records.
   I'll try running few pipelines by disabling this config.
   3. For now i've replayed the events to correct the inconsistencies since it 
impacts our reports. Have seen similar issues in the past on other tables. Will 
do this analysis when I come across this issue again.
   4. The count of unique records in entire table is lower in Hudi table 
compared to raw table. Hence to dig deep, I ran query for hourly comparison.
   
   for the notes on configurations
   1. Verified that all records are having unique `id`s. The hourly distinct 
count of unique `id`s on raw tables are matching with the source db but doesn't 
match with hudi table.
   2. We are doing some transformation to drop `__op` and `__source_ts_ms` and 
explicitly set `_hoodie_is_deleted` with false make sure we retain deleted 
records also. 
   3. Will try out disabling dynamic allocation.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to