[ 
https://issues.apache.org/jira/browse/BAHIR-213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16895999#comment-16895999
 ] 

Steve Loughran commented on BAHIR-213:
--------------------------------------

BTW, because of the delay between S3 change and event being processed, there's 
a risk of changes in the store happening before the stream handler sees it

1. POST path
2. event #1 queued
3. DELETE path
4. event #2 queued
5. event #1 received
5. FNFE when querying file

Also: double update

1. POST path
2. event #1 queued
3. POST path 
4. event #2 queued
5. event #1 received
6.. contents of path are at state (3)
7. event #2 received even though state hasn't changed

there's also two other issues
*  the risk of events arriving out of order.
* the risk of a previous state of the file (contents or tombstone) being seen 
in processing event #1

What does that mean? I think it means that you need to handle
* file potentially missing when you receive the event...but you still need to 
handle the possibility that a tombstone was cached before the post #1 
operation, so may want to spin a bit awaiting its arrival.
* file details when processing event different from that in the event data

the best thing to do here is demand that every file uploaded MUST have a unique 
name, while making sure that the new stream source is resilient to changes (i.e 
downgrades if the source file isn't there...), without offering any guarantees 
of correctness






> Faster S3 file Source for Structured Streaming with SQS
> -------------------------------------------------------
>
>                 Key: BAHIR-213
>                 URL: https://issues.apache.org/jira/browse/BAHIR-213
>             Project: Bahir
>          Issue Type: New Feature
>          Components: Spark Structured Streaming Connectors
>    Affects Versions: Spark-2.4.0
>            Reporter: Abhishek Dixit
>            Priority: Major
>
> Using FileStreamSource to read files from a S3 bucket has problems both in 
> terms of costs and latency:
>  * *Latency:* Listing all the files in S3 buckets every microbatch can be 
> both slow and resource intensive.
>  * *Costs:* Making List API requests to S3 every microbatch can be costly.
> The solution is to use Amazon Simple Queue Service (SQS) which lets you find 
> new files written to S3 bucket without the need to list all the files every 
> microbatch.
> S3 buckets can be configured to send notification to an Amazon SQS Queue on 
> Object Create / Object Delete events. For details see AWS documentation here 
> [Configuring S3 Event 
> Notifications|https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html]
>  
> Spark can leverage this to find new files written to S3 bucket by reading 
> notifications from SQS queue instead of listing files every microbatch.
> I hope to contribute changes proposed in [this pull 
> request|https://github.com/apache/spark/pull/24934] to Apache Bahir as 
> suggested by @[gaborgsomogyi|https://github.com/gaborgsomogyi]  
> [here|https://github.com/apache/spark/pull/24934#issuecomment-511389130]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to