Abhishek Dixit created BAHIR-213:
------------------------------------

             Summary: Faster S3 file Source for Structured Streaming with SQS
                 Key: BAHIR-213
                 URL: https://issues.apache.org/jira/browse/BAHIR-213
             Project: Bahir
          Issue Type: New Feature
          Components: Spark Structured Streaming Connectors
    Affects Versions: Spark-2.3.0, Spark-2.4.0
            Reporter: Abhishek Dixit


Using FileStreamSource to read files from a S3 bucket has problems both in 
terms of costs and latency:
 * *Latency:* Listing all the files in S3 buckets every microbatch can be both 
slow and resource intensive.
 * *Costs:* Making List API requests to S3 every microbatch can be costly.

The solution is to use Amazon Simple Queue Service (SQS) which lets you find 
new files written to S3 bucket without the need to list all the files every 
microbatch.

S3 buckets can be configured to send notification to an Amazon SQS Queue on 
Object Create / Object Delete events. For details see AWS documentation here 
[Configuring S3 Event 
Notifications|https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html]
 

Spark can leverage this to find new files written to S3 bucket by reading 
notifications from SQS queue instead of listing files every microbatch.

 

I hope to contribute [this PR|[https://github.com/apache/spark/pull/24934]] to 
Apache Bahir as suggested by @[gaborgsomogyi|https://github.com/gaborgsomogyi]  
[here|https://github.com/apache/spark/pull/24934#issuecomment-511389130]

 

 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to