abhijeetkushe opened a new issue #2850: URL: https://github.com/apache/hudi/issues/2850
**Describe the problem you faced** We have a hoodiedeltastreamer application deployed in EMR which reads objects from source bucket : s3://<landing_bucket>/<event_type>/<year>/<mm>/<dd> which is populated by a kinesis firehose located in a different account and writes to a destination hudi table s3://<target_bucket>/hudi/<event_type>_cow_1/. We have been noticing a number of missing records since the application was deployed in continuous mode on 03/01/2021.When we investigated the issue we we found that hoodiedeltastreamer was skipping files in the landing bucket which have been created a few seconds prior to the time deltastreamer ran.I have created an AWS support to address this issue but I wanted to know if this is a known issue with hoodiedeltastreamer and whether you can propose solutions like [SQS-S3](https://docs.databricks.com/spark/latest/structured-streaming/sqs.html) which can help address this issue.I will describe the problem in more detail with Hudi logs and S3 file table below. | - |File name| Last Modified Time| Size | Standard |-- | -- | -- | -- | --| | |ctct-tdp-p2-send-5-2021-04-06-12-28-54-70155892-e563-4a15-b1b8-70b77063ff3b | April 6, 2021, 08:33:57 (UTC-04:00) | 12.3 MB | Standard | ctct-tdp-p2-send-5-2021-04-06-12-28-54-d295d90f-0db8-4b42-b0b5-dcce28bbcee6 | April 6, 2021, 08:33:57 (UTC-04:00) | 12.4 MB | Standard | ctct-tdp-p2-send-5-2021-04-06-12-29-49-c92cd09d-4dc3-4e38-b77c-f9c6553cf882 | April 6, 2021, 08:34:51 (UTC-04:00) | 15.8 MB | Standard | ctct-tdp-p2-send-5-2021-04-06-12-30-27-fbe50352-3933-4758-b511-800388c04027 | April 6, 2021, 08:35:29 (UTC-04:00) | 16.9 MB | Standard | ctct-tdp-p2-send-5-2021-04-06-12-30-43-66bc4ad4-36da-40e1-819f-cdd33b3ecd91 | April 6, 2021, 08:35:49 (UTC-04:00) | 17.8 MB | Standard | ctct-tdp-p2-send-5-2021-04-06-12-30-47-12af0a73-2123-4a75-911e-0fda1f12bc2c | April 6, 2021, 08:35:50 (UTC-04:00) | 17.7 MB | Standard | ctct-tdp-p2-send-5-2021-04-06-12-30-47-f6330ab5-3d18-4518-9e30-1c83a27875d1 | April 6, 2021, 08:35:50 (UTC-04:00) | 17.7 MB | Standard | ctct-tdp-p2-send-5-2021-04-06-12-30-48-f32fea58-2ee2-4aa9-9a30-33a0999f2d17 | April 6, 2021, 08:35:50 (UTC-04:00) | 17.7 MB | Standard | ctct-tdp-p2-send-5-2021-04-06-12-31-11-059e0cb0-6e2c-4476-8b1b-661a3f3f3e0f | April 6, 2021, 08:36:13 (UTC-04:00) | 18.3 MB | Standard | ctct-tdp-p2-send-5-2021-04-06-12-31-26-dbfce7fd-f732-42d8-bc87-49d1b670ab66 | April 6, 2021, 08:36:28 (UTC-04:00) | 18.4 MB | Standard | ctct-tdp-p2-send-5-2021-04-06-12-31-26-31c85360-6c75-4ea6-8478-4d5fbfc5213c | April 6, 2021, 08:36:29 (UTC-04:00) | 18.5 MB | Standard The file ctct-tdp-p2-send-5-2021-04-06-12-30-43-66bc4ad4-36da-40e1-819f-cdd33b3ecd91 was skipped by hudi Hudi logs in EMR are in UTC 2021-04-06 12:35:53 INFO DeltaSync:294 - Checkpoint to resume from : Option{val=1617712051000} 1617712051000 => Tuesday, April 6, 2021 8:27:31 (UTC-04:00) / Tuesday, April 6, 2021 12:27:31 UTC This means HoodieDeltaStreamer is trying to read files that have landed from Tuesday, April 6, 2021 8:27:31 AM (UTC-04:00) onwards at 2021-04-06 12:35:53 UTC / 2021-04-06 08:35:53 (UTC-04:00) 2021-04-06 12:35:55 INFO FileInputFormat:228 - Total input files to process : 58 . This statement means 58 files were read in this run so you would expect files landed before 2021-04-06 12:35:53 UTC / 2021-04-06 08:35:53 (UTC-04:00) to be read in these files. The next run started from the below checkpoint 2021-04-06 12:44:36 INFO DeltaSync:294 - Checkpoint to resume from : Option{val=1617712550000} 1617712550000 => Tuesday, April 6, 2021 8:35:50 (UTC-04:00) / Tuesday, April 6, 2021 12:35:50 UTC Based on the above 2 checkpoints you would expect all files that landed prior to Tuesday, April 6, 2021 8:35:50 (UTC-04:00) / Tuesday, April 6, 2021 12:35:50 UTC to be picked in this run. And that expectation does hold true as the below 3 files all landed on April 6, 2021, 08:35:50 (UTC-04:00) / April 6, 2021, 12:35:50 UTC and where read by HoodieDeltaStreamer s3://<landing_bucket>/send/2021/04/06/12/ctct-tdp-p2-send-5-2021-04-06-12-30-47-12af0a73-2123-4a75-911e-0fda1f12bc2c, s3://<landing_bucket>/send/2021/04/06/12/ctct-tdp-p2-send-5-2021-04-06-12-30-47-f6330ab5-3d18-4518-9e30-1c83a27875d1, s3://<landing_bucket>/send/2021/04/06/12/ctct-tdp-p2-send-5-2021-04-06-12-30-48-f32fea58-2ee2-4aa9-9a30-33a0999f2d17 But the file ctct-tdp-p2-send-5-2021-04-06-12-30-43-66bc4ad4-36da-40e1-819f-cdd33b3ecd91 which landed on April 6, 2021, 08:35:49 (UTC-04:00) / April 6, 2021, 12:35:49 UTC was not read I have created an AWS support ticket to get details of this error as AWS announced last year in December that s3 is now strongly consistent https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-read-after-write-consistency/ **To Reproduce** This might be difficult to simulate the rate at which files land as well as size Steps to reproduce the behavior: 1. Kinesis Firehose which produces event sends with buffer size 100MB or time 1 hour 2. EMR hoodiedeltastreamer which runs in continuous mode on the s3://<landing_bucket>/send/<yyyy>/<mm>/dd .Each day a new EMR gets created. **Expected behavior** No file gets skipped **Environment Description** * Hudi version : 0.6.0 * Spark version : 2.4.6 * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : No * EMR instance type: m5.4xlarge -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
