[ 
https://issues.apache.org/jira/browse/HUDI-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405805#comment-17405805
 ] 

ASF GitHub Bot commented on HUDI-2309:
--------------------------------------

vinothchandar commented on a change in pull request #3526:
URL: https://github.com/apache/hudi/pull/3526#discussion_r697407383



##########
File path: website/blog/2021-08-23-s3-events-source.md
##########
@@ -0,0 +1,117 @@
+---
+title: "Reliable ingestion from AWS S3 using Hudi"
+excerpt: "From listing to log-based approach, a reliable way of ingesting data 
from AWS S3 into Hudi."
+author: codope
+category: blog
+---
+
+In this post we will talk about a new deltastreamer source which reliably and 
efficiently processes new data files as they arrive in AWS S3.
+
+<!--truncate-->
+
+## Motivation
+
+As of today, to ingest data from S3 into Hudi, users leverage DFS source whose 
[path 
selector](https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/DFSPathSelector.java)
 would identify the source files modified since the last checkpoint based on 
max modification time. 

Review comment:
       for links, can we link to the 0.9.0 branch ? or use a permalink? 

##########
File path: website/blog/2021-08-23-s3-events-source.md
##########
@@ -0,0 +1,117 @@
+---
+title: "Reliable ingestion from AWS S3 using Hudi"
+excerpt: "From listing to log-based approach, a reliable way of ingesting data 
from AWS S3 into Hudi."
+author: codope
+category: blog
+---
+
+In this post we will talk about a new deltastreamer source which reliably and 
efficiently processes new data files as they arrive in AWS S3.
+
+<!--truncate-->
+
+## Motivation
+
+As of today, to ingest data from S3 into Hudi, users leverage DFS source whose 
[path 
selector](https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/DFSPathSelector.java)
 would identify the source files modified since the last checkpoint based on 
max modification time. 
+The problem with this approach is that modification time precision is upto 
seconds in S3. It maybe possible that there were many files (beyond what the 
configurable source limit allows) modifed in that second and some files might 
be skipped. 
+For more details, please refer to 
[HUDI-1723](https://issues.apache.org/jira/browse/HUDI-1723). 
+While the workaround is to ignore the source limit and keep reading, the 
problem motivated us to redesign so that users can reliably ingest from S3.
+
+## Design
+
+For use-cases where seconds granularity does not suffice, we have a new source 
in deltastreamer using log-based approach. 
+The new [S3 events 
source](https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsSource.java)
 relies on change notification and incremental processing to ingest from S3. 
+The architecture is as shown in the figure below.
+
+![Different components in the 
design](/assets/images/blog/s3_events_source_design.png)
+
+In this approach, users need to [enable S3 event 
notifications](https://docs.aws.amazon.com/AmazonS3/latest/userguide/NotificationHowTo.html).
 
+There will be two deltastreamers as detailed below. 
+
+1. 
[S3EventsSource](https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsSource.java):
 Create Hudi S3 metadata table. This source leverages AWS SNS and SQS services 
that subscribe to file events from the source bucket.

Review comment:
       links for SQS/SNS?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


> Document deltastreamer source for AWS S3
> ----------------------------------------
>
>                 Key: HUDI-2309
>                 URL: https://issues.apache.org/jira/browse/HUDI-2309
>             Project: Apache Hudi
>          Issue Type: Sub-task
>          Components: Docs
>            Reporter: Sagar Sumit
>            Assignee: Sagar Sumit
>            Priority: Major
>              Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to