abhijeetkushe opened a new issue #2850:
URL: https://github.com/apache/hudi/issues/2850


   **Describe the problem you faced**
   We have a hoodiedeltastreamer application deployed in EMR which reads 
objects from source bucket : 
s3://<landing_bucket>/<event_type>/<year>/<mm>/<dd> which is populated by a 
kinesis firehose located in a different account and writes to a destination 
hudi table s3://<target_bucket>/hudi/<event_type>_cow_1/. We have been noticing 
a number of missing records since the application was deployed in continuous 
mode on 03/01/2021.When we investigated the issue we we found that 
hoodiedeltastreamer was skipping files in the landing bucket which have been 
created a few seconds prior to the time deltastreamer ran.I have created an AWS 
support to address this issue but I wanted to know if this is a known issue 
with hoodiedeltastreamer and whether you can propose solutions like  
[SQS-S3](https://docs.databricks.com/spark/latest/structured-streaming/sqs.html)
 which can help address this issue.I will describe the problem in more detail 
with Hudi logs and S3 file table below. 
   
   | - |File name| Last Modified Time| Size | Standard
   |-- | -- | -- | -- | --|
   | 
|ctct-tdp-p2-send-5-2021-04-06-12-28-54-70155892-e563-4a15-b1b8-70b77063ff3b |  
April 6, 2021, 08:33:57 (UTC-04:00) | 12.3 MB | Standard
     | 
ctct-tdp-p2-send-5-2021-04-06-12-28-54-d295d90f-0db8-4b42-b0b5-dcce28bbcee6 | 
April 6, 2021, 08:33:57 (UTC-04:00) | 12.4 MB | Standard
     | 
ctct-tdp-p2-send-5-2021-04-06-12-29-49-c92cd09d-4dc3-4e38-b77c-f9c6553cf882 | 
April 6, 2021, 08:34:51 (UTC-04:00) | 15.8 MB | Standard
     | 
ctct-tdp-p2-send-5-2021-04-06-12-30-27-fbe50352-3933-4758-b511-800388c04027 | 
April 6, 2021, 08:35:29 (UTC-04:00) | 16.9 MB | Standard
     | 
ctct-tdp-p2-send-5-2021-04-06-12-30-43-66bc4ad4-36da-40e1-819f-cdd33b3ecd91 |  
April 6, 2021, 08:35:49 (UTC-04:00) | 17.8 MB | Standard
     | 
ctct-tdp-p2-send-5-2021-04-06-12-30-47-12af0a73-2123-4a75-911e-0fda1f12bc2c |  
April 6, 2021, 08:35:50 (UTC-04:00) | 17.7 MB | Standard
     | 
ctct-tdp-p2-send-5-2021-04-06-12-30-47-f6330ab5-3d18-4518-9e30-1c83a27875d1 | 
April 6, 2021, 08:35:50 (UTC-04:00) | 17.7 MB | Standard
     | 
ctct-tdp-p2-send-5-2021-04-06-12-30-48-f32fea58-2ee2-4aa9-9a30-33a0999f2d17 |  
April 6, 2021, 08:35:50 (UTC-04:00) | 17.7 MB | Standard
     | 
ctct-tdp-p2-send-5-2021-04-06-12-31-11-059e0cb0-6e2c-4476-8b1b-661a3f3f3e0f |  
April 6, 2021, 08:36:13 (UTC-04:00) | 18.3 MB | Standard
     | 
ctct-tdp-p2-send-5-2021-04-06-12-31-26-dbfce7fd-f732-42d8-bc87-49d1b670ab66 |  
April 6, 2021, 08:36:28 (UTC-04:00) | 18.4 MB | Standard
     | 
ctct-tdp-p2-send-5-2021-04-06-12-31-26-31c85360-6c75-4ea6-8478-4d5fbfc5213c |  
April 6, 2021, 08:36:29 (UTC-04:00) | 18.5 MB | Standard
   
   The file 
ctct-tdp-p2-send-5-2021-04-06-12-30-43-66bc4ad4-36da-40e1-819f-cdd33b3ecd91 was 
skipped by hudi
   
   Hudi logs in EMR  are in UTC 
   
   2021-04-06 12:35:53 INFO  DeltaSync:294 - Checkpoint to resume from : 
Option{val=1617712051000}
   1617712051000 => Tuesday, April 6, 2021 8:27:31  (UTC-04:00) / Tuesday, 
April 6, 2021 12:27:31 UTC
   
   This means HoodieDeltaStreamer is trying to read files that have landed from 
Tuesday, April 6, 2021 8:27:31 AM (UTC-04:00) onwards at 2021-04-06 12:35:53 
UTC / 2021-04-06 08:35:53 (UTC-04:00)
   
   2021-04-06 12:35:55 INFO  FileInputFormat:228 - Total input files to process 
: 58 . 
   This  statement means 58 files were read in this run so you would expect 
files landed before  2021-04-06 12:35:53 UTC / 2021-04-06 08:35:53 (UTC-04:00)  
to be read in these files. 
   
   The next run started from the below checkpoint
   
   2021-04-06 12:44:36 INFO  DeltaSync:294 - Checkpoint to resume from : 
Option{val=1617712550000} 
   
   1617712550000 => Tuesday, April 6, 2021 8:35:50 (UTC-04:00) / Tuesday, April 
6, 2021 12:35:50 UTC 
   
   Based on the above 2 checkpoints you would expect all files that landed 
prior to  Tuesday, April 6, 2021 8:35:50 (UTC-04:00) / Tuesday, April 6, 2021 
12:35:50 UTC   to be picked in this run.
   And that expectation does hold true as the below 3 files all landed on April 
6, 2021, 08:35:50 (UTC-04:00)  /  April 6, 2021, 12:35:50 UTC and where read by 
HoodieDeltaStreamer
   
   
s3://<landing_bucket>/send/2021/04/06/12/ctct-tdp-p2-send-5-2021-04-06-12-30-47-12af0a73-2123-4a75-911e-0fda1f12bc2c,
 
   
s3://<landing_bucket>/send/2021/04/06/12/ctct-tdp-p2-send-5-2021-04-06-12-30-47-f6330ab5-3d18-4518-9e30-1c83a27875d1,
   
s3://<landing_bucket>/send/2021/04/06/12/ctct-tdp-p2-send-5-2021-04-06-12-30-48-f32fea58-2ee2-4aa9-9a30-33a0999f2d17
 
   
   But the file 
ctct-tdp-p2-send-5-2021-04-06-12-30-43-66bc4ad4-36da-40e1-819f-cdd33b3ecd91 
which landed on April 6, 2021, 08:35:49 (UTC-04:00)  / April 6, 2021, 12:35:49 
UTC was not read 
   
   I have created an AWS support ticket to get details of this error as AWS 
announced last year in December that s3 is now strongly consistent 
https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-read-after-write-consistency/
   **To Reproduce**
   This might be difficult to simulate the rate at which files land as well as 
size
   Steps to reproduce the behavior:
   
   1. Kinesis Firehose which produces event sends with buffer size 100MB or 
time 1 hour
   2. EMR hoodiedeltastreamer which runs in continuous mode on the 
s3://<landing_bucket>/send/<yyyy>/<mm>/dd .Each day a new EMR gets created.
   
   **Expected behavior**
   
   No file gets skipped
   
   **Environment Description**
   
   * Hudi version : 0.6.0
   
   * Spark version : 2.4.6
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : No
   * EMR instance type: m5.4xlarge 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to