Re: WAL on S3

2015-09-23 Thread Steve Loughran
On 23 Sep 2015, at 14:56, Michal Čizmazia <mici...@gmail.com<mailto:mici...@gmail.com>> wrote: To get around the fact that flush does not work in S3, my custom WAL implementation stores a separate S3 object per each WriteAheadLog.write call. Do you see any gotchas with t

Re: WAL on S3

2015-09-23 Thread Michal Čizmazia
new-usability-enhancements/> On 23 September 2015 at 13:12, Steve Loughran <ste...@hortonworks.com> wrote: > > On 23 Sep 2015, at 14:56, Michal Čizmazia <mici...@gmail.com> wrote: > > To get around the fact that flush does not work in S3, my custom WAL > implementatio

Re: WAL on S3

2015-09-23 Thread Tathagata Das
Responses inline. On Tue, Sep 22, 2015 at 8:35 PM, Michal Čizmazia wrote: > Can checkpoints be stored to S3 (via S3/S3A Hadoop URL)? > > Yes. Because checkpoints are single files by itself, and does not require flush semantics to work. So S3 is fine. > Trying to answer

Re: WAL on S3

2015-09-23 Thread Steve Loughran
On 23 Sep 2015, at 07:10, Tathagata Das > wrote: Responses inline. On Tue, Sep 22, 2015 at 8:35 PM, Michal Čizmazia > wrote: Can checkpoints be stored to S3 (via S3/S3A Hadoop URL)? Yes. Because

Re: WAL on S3

2015-09-22 Thread Tathagata Das
You can keep the checkpoints in the Hadoop-compatible file system and the WAL somewhere else using your custom WAL implementation. Yes, cleaning up the stuff gets complicated as it is not as easy as deleting off the checkpoint directory - you will have to clean up checkpoint directory as well as

Re: WAL on S3

2015-09-22 Thread Michal Čizmazia
Can checkpoints be stored to S3 (via S3/S3A Hadoop URL)? Trying to answer this question, I looked into Checkpoint.getCheckpointFiles [1]. It is doing findFirstIn which would probably be calling the S3 LIST operation. S3 LIST is prone to eventual consistency [2]. What would happen when

Re: WAL on S3

2015-09-22 Thread Michal Čizmazia
I am trying to use pluggable WAL, but it can be used only with checkpointing turned on. Thus I still need have a Hadoop-compatible file system. Is there something like pluggable checkpointing? Or can WAL be used without checkpointing? What happens when WAL is available but the checkpoint

Re: WAL on S3

2015-09-22 Thread Michal Čizmazia
My understanding of pluggable WAL was that it eliminates the need for having a Hadoop-compatible file system [1]. What is the use of pluggable WAL when it can be only used together with checkpointing which still requires a Hadoop-compatible file system? [1]:

Re: WAL on S3

2015-09-22 Thread Tathagata Das
1. Currently, the WAL can be used only with checkpointing turned on, because it does not make sense to recover from WAL if there is not checkpoint information to recover from. 2. Since the current implementation saves the WAL in the checkpoint directory, they share the fate -- if checkpoint

Re: WAL on S3

2015-09-18 Thread Tathagata Das
I dont think it would work with multipart upload either. The file is not visible until the multipart download is explicitly closed. So even if each write a part upload, all the parts are not visible until the multiple download is closed. TD On Fri, Sep 18, 2015 at 1:55 AM, Steve Loughran

Re: WAL on S3

2015-09-18 Thread Steve Loughran
> On 17 Sep 2015, at 21:40, Tathagata Das wrote: > > Actually, the current WAL implementation (as of Spark 1.5) does not work with > S3 because S3 does not support flushing. Basically, the current > implementation assumes that after write + flush, the data is immediately

Re: WAL on S3

2015-09-17 Thread Ted Yu
I assume you don't use Kinesis. Are you running Spark 1.5.0 ? If you must use S3, is switching to Kinesis possible ? Cheers On Thu, Sep 17, 2015 at 1:09 PM, Michal Čizmazia wrote: > How to make Write Ahead Logs to work with S3? Any pointers welcome! > > It seems as a known

Re: WAL on S3

2015-09-17 Thread Tathagata Das
Actually, the current WAL implementation (as of Spark 1.5) does not work with S3 because S3 does not support flushing. Basically, the current implementation assumes that after write + flush, the data is immediately durable, and readable if the system crashes without closing the WAL file. This does

WAL on S3

2015-09-17 Thread Michal Čizmazia
How to make Write Ahead Logs to work with S3? Any pointers welcome! It seems as a known issue: https://issues.apache.org/jira/browse/SPARK-9215 I am getting this exception when reading write ahead log: Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure:

Re: WAL on S3

2015-09-17 Thread Michal Čizmazia
Please could you explain how to use pluggable WAL? After I implement the WriteAheadLog abstract class, how can I use it? I want to use it with a Custom Reliable Receiver. I am using Spark 1.4.1. Thanks! On 17 September 2015 at 16:40, Tathagata Das wrote: > Actually, the

Re: WAL on S3

2015-09-17 Thread Tathagata Das
You could override the spark conf called "spark.streaming.receiver.writeAheadLog.class" with the class name. https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/util/WriteAheadLogUtils.scala#L30 On Thu, Sep 17, 2015 at 2:04 PM, Michal Čizmazia