[jira] [Commented] (BAHIR-213) Faster S3 file Source for Structured Streaming with SQS

ASF GitHub Bot (Jira) Wed, 04 Dec 2019 05:43:36 -0800


    [ 
https://issues.apache.org/jira/browse/BAHIR-213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16987877#comment-16987877
 ]


ASF GitHub Bot commented on BAHIR-213:
--------------------------------------

steveloughran commented on issue #91: [BAHIR-213] Faster S3 file Source for 
Structured Streaming with SQS
URL: https://github.com/apache/bahir/pull/91#issuecomment-561650867
 
 
   
   I'm not directly ignoring you, just some problems are stop me doing much 
coding right now. I had hoped to do a PoC what this would look like against 
hadoop-3.2.1 I'm so drive whatever changes needed to be done there to help this 
(e.g delegation tokens to support the SQS), plus some tests.
   
   But its not going to happen this year -sorry.
   
   Similarly, I'm cutting back on approximately all my reviews. Anything 
involving typing basically.
   
   I do think it's important -and I also think somebody needs to look at spark 
streaming checkpointing -against S3 put-with-overwrite works the way rename 
doesn't. Just not going to be me.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Faster S3 file Source for Structured Streaming with SQS
> -------------------------------------------------------
>
>                 Key: BAHIR-213
>                 URL: https://issues.apache.org/jira/browse/BAHIR-213
>             Project: Bahir
>          Issue Type: New Feature
>          Components: Spark Structured Streaming Connectors
>    Affects Versions: Spark-2.4.0
>            Reporter: Abhishek Dixit
>            Priority: Major
>
> Using FileStreamSource to read files from a S3 bucket has problems both in 
> terms of costs and latency:
>  * *Latency:* Listing all the files in S3 buckets every microbatch can be 
> both slow and resource intensive.
>  * *Costs:* Making List API requests to S3 every microbatch can be costly.
> The solution is to use Amazon Simple Queue Service (SQS) which lets you find 
> new files written to S3 bucket without the need to list all the files every 
> microbatch.
> S3 buckets can be configured to send notification to an Amazon SQS Queue on 
> Object Create / Object Delete events. For details see AWS documentation here 
> [Configuring S3 Event 
> Notifications|https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html]
>  
> Spark can leverage this to find new files written to S3 bucket by reading 
> notifications from SQS queue instead of listing files every microbatch.
> I hope to contribute changes proposed in [this pull 
> request|https://github.com/apache/spark/pull/24934] to Apache Bahir as 
> suggested by [gaborgsomogyi|https://github.com/gaborgsomogyi]  
> [here|https://github.com/apache/spark/pull/24934#issuecomment-511389130]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (BAHIR-213) Faster S3 file Source for Structured Streaming with SQS

Reply via email to