[ 
https://issues.apache.org/jira/browse/BAHIR-213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17004574#comment-17004574
 ] 

ASF subversion and git services commented on BAHIR-213:
-------------------------------------------------------

Commit d036820c0efa1b2e9b8021506164b67582352dff in bahir's branch 
refs/heads/master from abhishekd0907
[ https://gitbox.apache.org/repos/asf?p=bahir.git;h=d036820 ]

[BAHIR-213] Faster S3 file Source for Structured Streaming with SQS (#91)

Using FileStreamSource to read files from a S3 bucket has problems 
both in terms of costs and latency:

Latency: Listing all the files in S3 buckets every micro-batch can be both
slow and resource-intensive.

Costs: Making List API requests to S3 every micro-batch can be costly.

The solution is to use Amazon Simple Queue Service (SQS) which lets 
you find new files written to S3 bucket without the need to list all the 
files every micro-batch.

S3 buckets can be configured to send a notification to an Amazon SQS Queue
on Object Create / Object Delete events. For details see AWS documentation
here Configuring S3 Event Notifications

Spark can leverage this to find new files written to S3 bucket by reading 
notifications from SQS queue instead of listing files every micro-batch.

This PR adds a new SQSSource which uses Amazon SQS queue to find 
new files every micro-batch.

Usage

val inputDf = spark .readStream
   .format("s3-sqs")
   .schema(schema)
   .option("fileFormat", "json")
   .option("sqsUrl", "https://QUEUE_URL";)
   .option("region", "us-east-1")
   .load()


> Faster S3 file Source for Structured Streaming with SQS
> -------------------------------------------------------
>
>                 Key: BAHIR-213
>                 URL: https://issues.apache.org/jira/browse/BAHIR-213
>             Project: Bahir
>          Issue Type: New Feature
>          Components: Spark Structured Streaming Connectors
>    Affects Versions: Spark-2.4.0
>            Reporter: Abhishek Dixit
>            Priority: Major
>
> Using FileStreamSource to read files from a S3 bucket has problems both in 
> terms of costs and latency:
>  * *Latency:* Listing all the files in S3 buckets every microbatch can be 
> both slow and resource intensive.
>  * *Costs:* Making List API requests to S3 every microbatch can be costly.
> The solution is to use Amazon Simple Queue Service (SQS) which lets you find 
> new files written to S3 bucket without the need to list all the files every 
> microbatch.
> S3 buckets can be configured to send notification to an Amazon SQS Queue on 
> Object Create / Object Delete events. For details see AWS documentation here 
> [Configuring S3 Event 
> Notifications|https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html]
>  
> Spark can leverage this to find new files written to S3 bucket by reading 
> notifications from SQS queue instead of listing files every microbatch.
> I hope to contribute changes proposed in [this pull 
> request|https://github.com/apache/spark/pull/24934] to Apache Bahir as 
> suggested by [gaborgsomogyi|https://github.com/gaborgsomogyi]  
> [here|https://github.com/apache/spark/pull/24934#issuecomment-511389130]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to