RE: StructuredStreaming status

2016-10-19 Thread assaf.mendelson
There is one issue I was thinking of. If I understand correctly, structured streaming basically groups by a bucket for time in sliding window (of the step). My problem is that in some cases (e.g. distinct count and any other case where the buffer is relatively large) this would mean copying the

Re: StructuredStreaming status

2016-10-19 Thread Abhishek R. Singh
Its not so much about latency actually. The bigger rub for me is that the state has to be reshuffled every micro/mini-batch (unless I am not understanding it right - spark 2.0 state model i.e.). Operator model avoids it by preserving state locality. Event time processing and state purging are

Re: StructuredStreaming status

2016-10-19 Thread Matei Zaharia
Both Spark Streaming and Structured Streaming preserve locality for operator state actually. They only reshuffle state if a cluster node fails or if the load becomes heavily imbalanced and it's better to launch a task on another node and load the state remotely. Matei > On Oct 19, 2016, at

Re: StructuredStreaming status

2016-10-19 Thread Matei Zaharia
Yeah, as Shivaram pointed out, there have been research projects that looked at it. Also, Structured Streaming was explicitly designed to not make microbatching part of the API or part of the output behavior (tying triggers to it). However, when people begin working on that is a function of

Re: StructuredStreaming status

2016-10-19 Thread Cody Koeninger
I don't think it's just about what to target - if you could target 1ms batches, without harming 1 second or 1 minute batches why wouldn't you? I think it's about having a clear strategy and dedicating resources to it. If scheduling batches at an order of magnitude or two lower latency is the

Re: Mini-Proposal: Make it easier to contribute to the contributing to Spark Guide

2016-10-19 Thread Reynold Xin
For the contributing guide I think it makes more sense to put it in apache/spark github, since that's where contributors start. I'd also link to it from the website ... On Tue, Oct 18, 2016 at 10:03 AM, Shivaram Venkataraman < shiva...@eecs.berkeley.edu> wrote: > +1 - Given that our website is

Re: StructuredStreaming status

2016-10-19 Thread Matei Zaharia
I'm also curious whether there are concerns other than latency with the way stuff executes in Structured Streaming (now that the time steps don't have to act as triggers), as well as what latency people want for various apps. The stateful operator designs for streaming systems aren't inherently

Re: StructuredStreaming status

2016-10-19 Thread Ofir Manor
Thanks a lot Michael! I really appreciate your sharing. Logistically, I suggest to find a way to tag all structured streaming JIRAs, so it wouldn't so hard to look for them, for anyone wanting to participate, and also have something like the ML roadmap JIRA. regarding your list, evicting space

Re: Why the json file used by sparkSession.read.json must be a valid json object per line

2016-10-19 Thread Michael Armbrust
On Sun, Oct 16, 2016 at 3:50 AM, wrote: > Think of it as jsonl instead of a json file. > Point people at this if they need an official looking spec: > http://jsonlines.org/ > That link is awesome. I think it would be great if someone could open a PR to add this to our

Re: StructuredStreaming status

2016-10-19 Thread Amit Sela
On Thu, Oct 20, 2016 at 12:07 AM Shivaram Venkataraman < shiva...@eecs.berkeley.edu> wrote: > At the AMPLab we've been working on a research project that looks at > just the scheduling latencies and on techniques to get lower > scheduling latency. It moves away from the micro-batch model, but >

Re: StructuredStreaming status

2016-10-19 Thread Shivaram Venkataraman
At the AMPLab we've been working on a research project that looks at just the scheduling latencies and on techniques to get lower scheduling latency. It moves away from the micro-batch model, but reuses the fault tolerance etc. in Spark. However we haven't yet figure out all the parts in

Re: StructuredStreaming status

2016-10-19 Thread Amit Sela
I've been working on the Apache Beam Spark runner which is (in this context) basically running a streaming model that focuses on event-time and correctness with Spark, and as I see it (even in spark 1.6.x) the micro-batches are really just added latency, which will work-out for some users, and not

Re: StructuredStreaming status

2016-10-19 Thread Michael Armbrust
I know people are seriously thinking about latency. So far that has not been the limiting factor in the users I've been working with. On Wed, Oct 19, 2016 at 1:11 PM, Cody Koeninger wrote: > Is anyone seriously thinking about alternatives to microbatches? > > On Wed, Oct

Re: StructuredStreaming status

2016-10-19 Thread Cody Koeninger
Is anyone seriously thinking about alternatives to microbatches? On Wed, Oct 19, 2016 at 2:45 PM, Michael Armbrust wrote: > Anything that is actively being designed should be in JIRA, and it seems > like you found most of it. In general, release windows can be found on

Re: StructuredStreaming status

2016-10-19 Thread Michael Armbrust
Anything that is actively being designed should be in JIRA, and it seems like you found most of it. In general, release windows can be found on the wiki . 2.1 has a lot of stability fixes as well as the kafka support you mentioned.

Re: [VOTE] Release Apache Spark 1.6.3 (RC1)

2016-10-19 Thread Sean Owen
Yeah I see that too. I'll work on back-porting it. The release otherwise looks good to me, but let's keep testing please to identify anything else in the meantime. On Wed, Oct 19, 2016 at 8:58 AM Pete Robbins wrote: > We see a regression since 1.6.2. I think this PR needs

Re: [VOTE] Release Apache Spark 1.6.3 (RC1)

2016-10-19 Thread Pete Robbins
We see a regression since 1.6.2. I think this PR needs to be backported https://github.com/apache/spark/pull/13784 which resolves SPARK-16078. The PR that causes the issue (for SPARK-15613) was reverted just before 1.6.2 release then re-applied afterwards but this fix was only backported to 2.0.