[
https://issues.apache.org/jira/browse/HUDI-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405710#comment-17405710
]
ASF GitHub Bot commented on HUDI-2309:
--------------------------------------
codope commented on a change in pull request #3526:
URL: https://github.com/apache/hudi/pull/3526#discussion_r697295920
##########
File path: website/blog/2021-08-23-s3-events-source.md
##########
@@ -0,0 +1,111 @@
+---
+title: "Reliable ingestion from AWS S3 using Hudi"
+excerpt: "From listing to log-based approach, a reliable way of ingesting data
from AWS S3 into Hudi."
+author: codope
+category: blog
+---
+
+In this post we will talk about a new deltastreamer source which reliably and
efficiently processes new data files as they arrive in AWS S3.
+
+## Motivation
+
+To ingest from S3 Hudi users leverage DFS source whose [path
selector](https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/DFSPathSelector.java)
would identify the source files modified since the last checkpoint based on
max modification time.
+The problem with this approach is that modification time precision is upto
seconds in S3. It maybe possible that there were many files (beyond what the
configurable source limit allows) modifed in that second and some files might
be skipped.
+This issue happened in production. For more details, please refer to
[HUDI-1723](https://issues.apache.org/jira/browse/HUDI-1723).
+While the workaround was to ignore the source limit and keep reading, the
problem motivated us to redesign so that users can reliably ingest from S3.
+
+## Design
+
+We wanted to move away from listing to log-based approach.
Review comment:
Reworded.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
> Document deltastreamer source for AWS S3
> ----------------------------------------
>
> Key: HUDI-2309
> URL: https://issues.apache.org/jira/browse/HUDI-2309
> Project: Apache Hudi
> Issue Type: Sub-task
> Components: Docs
> Reporter: Sagar Sumit
> Assignee: Sagar Sumit
> Priority: Major
> Labels: pull-request-available
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)