Re: StructuredStreaming status

2016-10-20 Thread Michael Armbrust
>
> On a personal note, I'm quite surprised that this is all the progress in
> Structured Streaming over the last three months since 2.0 was released. I
> was under the impression that this was one of the biggest things that the
> Spark community actively works on, but that is clearly not the case, given
> that most of the activity is a couple of (very important) JIRAs from the
> last several weeks. Not really sure how to parse that yet...
> I think having some clearer, prioritized roadmap going forward will be a
> good first to recalibrate expectations for 2.2 and for graduating from an
> alpha state.
>

I totally agree we should spend more time making sure the roadmap is clear
to everyone, but I disagree with this characterization.  There is a lot of
work happening in Structured Streaming. In this next release (2.1 as well
as 2.0.1 and 2.0.2) it has been more about stability and scalability rather
than user visible features.  We are running it for real on production jobs
and working to make it rock solid (Everyone can help here!). Just look at the
list of commits

.

Regarding the timeline to graduation, I think its instructive to look at
what happened with Spark SQL.

 - Spark 1.0 - added to Spark
 - Spark 1.1 - basic apis, and stability
 - Spark 1.2 - stabilization of Data Source APIs for plugging in external
sources
 - Spark 1.3 - GA
 - Spark 1.4-1.5 - Tungsten
 - Spark 1.6 - Fully-codegened / memory managed
 - Spark 2.0 - Whole stage codegen, experimental streaming support

We probably won't follow that exactly, and we clearly are not done yet.
However, I think the trajectory is good.

But Streaming Query sources
> 
>  are
> still designed with microbatches in mind, can this be removed and leave
> offset tracking to the executors?


It certainly could be, but what Matei is saying is that user code should be
able to seamlessly upgrade.  A lot of early focus and thought was towards
this goal.  However, these kinds of concerns are exactly why I think it is
premature to expose these internal APIs to end users. Lets build several
Sources and Sinks internally, and figure out what works and what doesn't.
Spark SQL had JSON, Hive, Parquet, and RDDs before we opened up the APIs.
This experience allowed us keep the Data Source API stable into 2.x and
build a large library of connectors.


Re: StructuredStreaming status

2016-10-20 Thread Amit Sela
On Thu, Oct 20, 2016 at 7:40 AM Matei Zaharia 
wrote:

> Yeah, as Shivaram pointed out, there have been research projects that
> looked at it. Also, Structured Streaming was explicitly designed to not
> make microbatching part of the API or part of the output behavior (tying
> triggers to it).
>
But Streaming Query sources

are
still designed with microbatches in mind, can this be removed and leave
offset tracking to the executors ?

> However, when people begin working on that is a function of demand
> relative to other features. I don't think we can commit to one plan before
> exploring more options, but basically there is Shivaram's project, which
> adds a few new concepts to the scheduler, and there's the option to reduce
> control plane latency in the current system, which hasn't been heavily
> optimized yet but should be doable (lots of systems can handle 10,000s of
> RPCs per second).
>
> Matei
>
> On Oct 19, 2016, at 9:20 PM, Cody Koeninger  wrote:
>
> I don't think it's just about what to target - if you could target 1ms
> batches, without harming 1 second or 1 minute batches why wouldn't you?
> I think it's about having a clear strategy and dedicating resources to it.
> If  scheduling batches at an order of magnitude or two lower latency is the
> strategy, and that's actually feasible, that's great. But I haven't seen
> that clear direction, and this is by no means a recent issue.
>
> On Oct 19, 2016 7:36 PM, "Matei Zaharia"  wrote:
>
> I'm also curious whether there are concerns other than latency with the
> way stuff executes in Structured Streaming (now that the time steps don't
> have to act as triggers), as well as what latency people want for various
> apps.
>
> The stateful operator designs for streaming systems aren't inherently
> "better" than micro-batching -- they lose a lot of stuff that is possible
> in Spark, such as load balancing work dynamically across nodes, speculative
> execution for stragglers, scaling clusters up and down elastically, etc.
> Moreover, Spark itself could execute the current model with much lower
> latency. The question is just what combinations of latency, throughput,
> fault recovery, etc to target.
>
> Matei
>
> On Oct 19, 2016, at 2:18 PM, Amit Sela  wrote:
>
>
>
> On Thu, Oct 20, 2016 at 12:07 AM Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
>
> At the AMPLab we've been working on a research project that looks at
> just the scheduling latencies and on techniques to get lower
> scheduling latency. It moves away from the micro-batch model, but
> reuses the fault tolerance etc. in Spark. However we haven't yet
> figure out all the parts in integrating this with the rest of
> structured streaming. I'll try to post a design doc / SIP about this
> soon.
>
> On a related note - are there other problems users face with
> micro-batch other than latency ?
>
> I think that the fact that they serve as an output trigger is a problem,
> but Structured Streaming seems to resolve this now.
>
>
> Thanks
> Shivaram
>
> On Wed, Oct 19, 2016 at 1:29 PM, Michael Armbrust
>  wrote:
> > I know people are seriously thinking about latency.  So far that has not
> > been the limiting factor in the users I've been working with.
> >
> > On Wed, Oct 19, 2016 at 1:11 PM, Cody Koeninger 
> wrote:
> >>
> >> Is anyone seriously thinking about alternatives to microbatches?
> >>
> >> On Wed, Oct 19, 2016 at 2:45 PM, Michael Armbrust
> >>  wrote:
> >> > Anything that is actively being designed should be in JIRA, and it
> seems
> >> > like you found most of it.  In general, release windows can be found
> on
> >> > the
> >> > wiki.
> >> >
> >> > 2.1 has a lot of stability fixes as well as the kafka support you
> >> > mentioned.
> >> > It may also include some of the following.
> >> >
> >> > The items I'd like to start thinking about next are:
> >> >  - Evicting state from the store based on event time watermarks
> >> >  - Sessionization (grouping together related events by key /
> eventTime)
> >> >  - Improvements to the query planner (remove some of the restrictions
> on
> >> > what queries can be run).
> >> >
> >> > This is roughly in order based on what I've been hearing users hit the
> >> > most.
> >> > Would love more feedback on what is blocking real use cases.
> >> >
> >> > On Tue, Oct 18, 2016 at 1:51 AM, Ofir Manor 
> >> > wrote:
> >> >>
> >> >> Hi,
> >> >> I hope it is the right forum.
> >> >> I am looking for some information of what to expect from
> >> >> StructuredStreaming in its next releases to help me choose when /
> where
> >> >> to
> >> >> start using it more seriously (or where to invest in workarounds and
> >> >> where
> >> >> to wait). I couldn't find a good place where such planning discussed
> >> >> for 2.1
> >> >> (like, for example ML and SPARK-15581).
> >> >> I'm aware of the 2.0 documented limits
> >> >>
> >> >> (
> http:/

RE: StructuredStreaming status

2016-10-20 Thread assaf.mendelson
My thoughts were of handling just the “current” state of the sliding window 
(i.e. the “last” window). The idea is that at least in cases which I 
encountered, the sliding window is used to “forget” irrelevant information and 
therefore when a step goes out of  date for the “current” window it becomes 
irrelevant.
I agree that this use case is just an example and will also have issues if 
there is a combination of windows. My main issue was that if we need to have a 
relatively large buffer (such as full distinct count) then the memory overhead 
of this can be very high.

As for the example of the map you gave, If I understand correctly how this 
would occur behind the scenes, this just provides the map but the memory cost 
of having multiple versions of the data remain. As I said, my issue is with the 
high memory overhead.

Consider a simple example: I do a sliding window of 1 day with a 1 minute step. 
There are 1440 minutes per day which means the groupby has a cost of 
multiplying all aggregations by 1440. For something such as a count or sum, 
this might not be a big issue but if we have an array of say 100 elements then 
this can quickly become very costly.

As I said, it is just an idea for optimization for specific use cases.


From: Michael Armbrust [via Apache Spark Developers List] 
[mailto:ml-node+s1001551n1952...@n3.nabble.com]
Sent: Thursday, October 20, 2016 11:16 AM
To: Mendelson, Assaf
Subject: Re: StructuredStreaming status

let’s say we would have implemented distinct count by saving a map with the key 
being the distinct value and the value being the last time we saw this value. 
This would mean that we wouldn’t really need to save all the steps in the 
middle and copy the data, we could only save the last portion.

I don't think you can calculate count distinct in each event time window 
correctly using this map if there is late data, which is one of the key 
problems we are trying to solve with this API.  If you are only tracking the 
last time you saw this value, how do you know if a late data item was already 
accounted for in any given window that is earlier than this "last time"?

We would currently need to track the items seen in each window (though much 
less space is required for approx count distinct).  However, the state eviction 
I mentioned above should also let you give us a boundary on how late data can 
be, and thus how many windows we need retain state for.  You should also be 
able to group by processing time instead of event time if you want something 
closer to the semantics of DStreams.

Finally, you can already construct the map you describe using structured 
streaming and use its result to output statistics at each trigger window:

df.groupBy($"value")
  .select(max($"eventTime") as 'lastSeen)
  .writeStream
  .outputMode("complete")
  .trigger(ProcessingTime("5 minutes"))
  .foreach(  )


If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/StructuredStreaming-status-tp19490p19520.html
To start a new topic under Apache Spark Developers List, email 
ml-node+s1001551n1...@n3.nabble.com<mailto:ml-node+s1001551n1...@n3.nabble.com>
To unsubscribe from Apache Spark Developers List, click 
here<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=YXNzYWYubWVuZGVsc29uQHJzYS5jb218MXwtMTI4OTkxNTg1Mg==>.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/StructuredStreaming-status-tp19490p19521.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: StructuredStreaming status

2016-10-20 Thread Michael Armbrust
>
> let’s say we would have implemented distinct count by saving a map with
> the key being the distinct value and the value being the last time we saw
> this value. This would mean that we wouldn’t really need to save all the
> steps in the middle and copy the data, we could only save the last portion.
>

I don't think you can calculate count distinct in each event time window
correctly using this map if there is late data, which is one of the key
problems we are trying to solve with this API.  If you are only tracking
the last time you saw this value, how do you know if a late data item was
already accounted for in any given window that is earlier than this "last
time"?

We would currently need to track the items seen in each window (though much
less space is required for approx count distinct).  However, the state
eviction I mentioned above should also let you give us a boundary on how
late data can be, and thus how many windows we need retain state for.  You
should also be able to group by processing time instead of event time if
you want something closer to the semantics of DStreams.

Finally, you can already construct the map you describe using structured
streaming and use its result to output statistics at each trigger window:

df.groupBy($"value")
  .select(max($"eventTime") as 'lastSeen)
  .writeStream
  .outputMode("complete")
  .trigger(ProcessingTime("5 minutes"))
  .foreach(  )


RE: StructuredStreaming status

2016-10-19 Thread assaf.mendelson
There is one issue I was thinking of.
If I understand correctly, structured streaming basically groups by a bucket 
for time in sliding window (of the step). My problem is that in some cases 
(e.g. distinct count and any other case where the buffer is relatively large) 
this would mean copying the buffer for each step. The can have a very large 
memory overhead.
There are other solutions for this. For example, let's say we would have 
implemented distinct count by saving a map with the key being the distinct 
value and the value being the last time we saw this value. This would mean that 
we wouldn't really need to save all the steps in the middle and copy the data, 
we could only save the last portion.
This is just an idea for optimization though, certainly nothing of high 
priority.


From: Matei Zaharia [via Apache Spark Developers List] 
[mailto:ml-node+s1001551n19513...@n3.nabble.com]
Sent: Thursday, October 20, 2016 3:42 AM
To: Mendelson, Assaf
Subject: Re: StructuredStreaming status

I'm also curious whether there are concerns other than latency with the way 
stuff executes in Structured Streaming (now that the time steps don't have to 
act as triggers), as well as what latency people want for various apps.

The stateful operator designs for streaming systems aren't inherently "better" 
than micro-batching -- they lose a lot of stuff that is possible in Spark, such 
as load balancing work dynamically across nodes, speculative execution for 
stragglers, scaling clusters up and down elastically, etc. Moreover, Spark 
itself could execute the current model with much lower latency. The question is 
just what combinations of latency, throughput, fault recovery, etc to target.

Matei

On Oct 19, 2016, at 2:18 PM, Amit Sela <[hidden 
email]> wrote:


On Thu, Oct 20, 2016 at 12:07 AM Shivaram Venkataraman <[hidden 
email]> wrote:
At the AMPLab we've been working on a research project that looks at
just the scheduling latencies and on techniques to get lower
scheduling latency. It moves away from the micro-batch model, but
reuses the fault tolerance etc. in Spark. However we haven't yet
figure out all the parts in integrating this with the rest of
structured streaming. I'll try to post a design doc / SIP about this
soon.

On a related note - are there other problems users face with
micro-batch other than latency ?
I think that the fact that they serve as an output trigger is a problem, but 
Structured Streaming seems to resolve this now.

Thanks
Shivaram

On Wed, Oct 19, 2016 at 1:29 PM, Michael Armbrust
<[hidden email]> wrote:
> I know people are seriously thinking about latency.  So far that has not
> been the limiting factor in the users I've been working with.
>
> On Wed, Oct 19, 2016 at 1:11 PM, Cody Koeninger <[hidden 
> email]> wrote:
>>
>> Is anyone seriously thinking about alternatives to microbatches?
>>
>> On Wed, Oct 19, 2016 at 2:45 PM, Michael Armbrust
>> <[hidden email]> wrote:
>> > Anything that is actively being designed should be in JIRA, and it seems
>> > like you found most of it.  In general, release windows can be found on
>> > the
>> > wiki.
>> >
>> > 2.1 has a lot of stability fixes as well as the kafka support you
>> > mentioned.
>> > It may also include some of the following.
>> >
>> > The items I'd like to start thinking about next are:
>> >  - Evicting state from the store based on event time watermarks
>> >  - Sessionization (grouping together related events by key / eventTime)
>> >  - Improvements to the query planner (remove some of the restrictions on
>> > what queries can be run).
>> >
>> > This is roughly in order based on what I've been hearing users hit the
>> > most.
>> > Would love more feedback on what is blocking real use cases.
>> >
>> > On Tue, Oct 18, 2016 at 1:51 AM, Ofir Manor <[hidden 
>> > email]>
>> > wrote:
>> >>
>> >> Hi,
>> >> I hope it is the right forum.
>> >> I am looking for some information of what to expect from
>> >> StructuredStreaming in its next releases to help me choose when / where
>> >> to
>> >> start using it more seriously (or where to invest in workarounds and
>> >> where
>> >> to wait). I couldn't find a good place where such planning discussed
>> >> for 2.1
>> >> (like, for example ML and SPARK-15581).
>> >> I'm aware of the 2.0 documented limits
>> >>
>> >> (http://spark.apache.org/docs/2.0.1/structured-streaming-programming-guide.html#unsupported-operations),
>> >> like no support for multiple aggregations level

Re: StructuredStreaming status

2016-10-19 Thread Abhishek R. Singh
Its not so much about latency actually. The bigger rub for me is that the state 
has to be reshuffled every micro/mini-batch (unless I am not understanding it 
right - spark 2.0 state model i.e.).

Operator model avoids it by preserving state locality. Event time processing 
and state purging are the other essentials (which are thankfully getting 
addressed).

Any guidance on (timelines for) expected exit from alpha state would also be 
greatly appreciated.

-Abhishek-

> On Oct 19, 2016, at 5:36 PM, Matei Zaharia  wrote:
> 
> I'm also curious whether there are concerns other than latency with the way 
> stuff executes in Structured Streaming (now that the time steps don't have to 
> act as triggers), as well as what latency people want for various apps.
> 
> The stateful operator designs for streaming systems aren't inherently 
> "better" than micro-batching -- they lose a lot of stuff that is possible in 
> Spark, such as load balancing work dynamically across nodes, speculative 
> execution for stragglers, scaling clusters up and down elastically, etc. 
> Moreover, Spark itself could execute the current model with much lower 
> latency. The question is just what combinations of latency, throughput, fault 
> recovery, etc to target.
> 
> Matei
> 
>> On Oct 19, 2016, at 2:18 PM, Amit Sela > > wrote:
>> 
>> 
>> 
>> On Thu, Oct 20, 2016 at 12:07 AM Shivaram Venkataraman 
>> mailto:shiva...@eecs.berkeley.edu>> wrote:
>> At the AMPLab we've been working on a research project that looks at
>> just the scheduling latencies and on techniques to get lower
>> scheduling latency. It moves away from the micro-batch model, but
>> reuses the fault tolerance etc. in Spark. However we haven't yet
>> figure out all the parts in integrating this with the rest of
>> structured streaming. I'll try to post a design doc / SIP about this
>> soon.
>> 
>> On a related note - are there other problems users face with
>> micro-batch other than latency ?
>> I think that the fact that they serve as an output trigger is a problem, but 
>> Structured Streaming seems to resolve this now.  
>> 
>> Thanks
>> Shivaram
>> 
>> On Wed, Oct 19, 2016 at 1:29 PM, Michael Armbrust
>> mailto:mich...@databricks.com>> wrote:
>> > I know people are seriously thinking about latency.  So far that has not
>> > been the limiting factor in the users I've been working with.
>> >
>> > On Wed, Oct 19, 2016 at 1:11 PM, Cody Koeninger > > > wrote:
>> >>
>> >> Is anyone seriously thinking about alternatives to microbatches?
>> >>
>> >> On Wed, Oct 19, 2016 at 2:45 PM, Michael Armbrust
>> >> mailto:mich...@databricks.com>> wrote:
>> >> > Anything that is actively being designed should be in JIRA, and it seems
>> >> > like you found most of it.  In general, release windows can be found on
>> >> > the
>> >> > wiki.
>> >> >
>> >> > 2.1 has a lot of stability fixes as well as the kafka support you
>> >> > mentioned.
>> >> > It may also include some of the following.
>> >> >
>> >> > The items I'd like to start thinking about next are:
>> >> >  - Evicting state from the store based on event time watermarks
>> >> >  - Sessionization (grouping together related events by key / eventTime)
>> >> >  - Improvements to the query planner (remove some of the restrictions on
>> >> > what queries can be run).
>> >> >
>> >> > This is roughly in order based on what I've been hearing users hit the
>> >> > most.
>> >> > Would love more feedback on what is blocking real use cases.
>> >> >
>> >> > On Tue, Oct 18, 2016 at 1:51 AM, Ofir Manor > >> > >
>> >> > wrote:
>> >> >>
>> >> >> Hi,
>> >> >> I hope it is the right forum.
>> >> >> I am looking for some information of what to expect from
>> >> >> StructuredStreaming in its next releases to help me choose when / where
>> >> >> to
>> >> >> start using it more seriously (or where to invest in workarounds and
>> >> >> where
>> >> >> to wait). I couldn't find a good place where such planning discussed
>> >> >> for 2.1
>> >> >> (like, for example ML and SPARK-15581).
>> >> >> I'm aware of the 2.0 documented limits
>> >> >>
>> >> >> (http://spark.apache.org/docs/2.0.1/structured-streaming-programming-guide.html#unsupported-operations
>> >> >>  
>> >> >> ),
>> >> >> like no support for multiple aggregations levels, joins are strictly to
>> >> >> a
>> >> >> static dataset (no SCD or stream-stream) etc, limited sources / sinks
>> >> >> (like
>> >> >> no sink for interactive queries) etc etc
>> >> >> I'm also aware of some changes that have landed in master, like the new
>> >> >> Kafka 0.10 source (and its on-going improvements) in SPARK-15406, the
>> >> >> metrics in SPARK-17731, and some improvements for the file source.
>> >> >> If I remember correctly, the discussion on Spark release cadence
>> >> >> concluded
>> >> >> with a preference to a four-month cyc

Re: StructuredStreaming status

2016-10-19 Thread Matei Zaharia
Both Spark Streaming and Structured Streaming preserve locality for operator 
state actually. They only reshuffle state if a cluster node fails or if the 
load becomes heavily imbalanced and it's better to launch a task on another 
node and load the state remotely.

Matei

> On Oct 19, 2016, at 9:38 PM, Abhishek R. Singh 
>  wrote:
> 
> Its not so much about latency actually. The bigger rub for me is that the 
> state has to be reshuffled every micro/mini-batch (unless I am not 
> understanding it right - spark 2.0 state model i.e.).
> 
> Operator model avoids it by preserving state locality. Event time processing 
> and state purging are the other essentials (which are thankfully getting 
> addressed).
> 
> Any guidance on (timelines for) expected exit from alpha state would also be 
> greatly appreciated.
> 
> -Abhishek-
> 
>> On Oct 19, 2016, at 5:36 PM, Matei Zaharia > > wrote:
>> 
>> I'm also curious whether there are concerns other than latency with the way 
>> stuff executes in Structured Streaming (now that the time steps don't have 
>> to act as triggers), as well as what latency people want for various apps.
>> 
>> The stateful operator designs for streaming systems aren't inherently 
>> "better" than micro-batching -- they lose a lot of stuff that is possible in 
>> Spark, such as load balancing work dynamically across nodes, speculative 
>> execution for stragglers, scaling clusters up and down elastically, etc. 
>> Moreover, Spark itself could execute the current model with much lower 
>> latency. The question is just what combinations of latency, throughput, 
>> fault recovery, etc to target.
>> 
>> Matei
>> 
>>> On Oct 19, 2016, at 2:18 PM, Amit Sela >> > wrote:
>>> 
>>> 
>>> 
>>> On Thu, Oct 20, 2016 at 12:07 AM Shivaram Venkataraman 
>>> mailto:shiva...@eecs.berkeley.edu>> wrote:
>>> At the AMPLab we've been working on a research project that looks at
>>> just the scheduling latencies and on techniques to get lower
>>> scheduling latency. It moves away from the micro-batch model, but
>>> reuses the fault tolerance etc. in Spark. However we haven't yet
>>> figure out all the parts in integrating this with the rest of
>>> structured streaming. I'll try to post a design doc / SIP about this
>>> soon.
>>> 
>>> On a related note - are there other problems users face with
>>> micro-batch other than latency ?
>>> I think that the fact that they serve as an output trigger is a problem, 
>>> but Structured Streaming seems to resolve this now.  
>>> 
>>> Thanks
>>> Shivaram
>>> 
>>> On Wed, Oct 19, 2016 at 1:29 PM, Michael Armbrust
>>> mailto:mich...@databricks.com>> wrote:
>>> > I know people are seriously thinking about latency.  So far that has not
>>> > been the limiting factor in the users I've been working with.
>>> >
>>> > On Wed, Oct 19, 2016 at 1:11 PM, Cody Koeninger >> > > wrote:
>>> >>
>>> >> Is anyone seriously thinking about alternatives to microbatches?
>>> >>
>>> >> On Wed, Oct 19, 2016 at 2:45 PM, Michael Armbrust
>>> >> mailto:mich...@databricks.com>> wrote:
>>> >> > Anything that is actively being designed should be in JIRA, and it 
>>> >> > seems
>>> >> > like you found most of it.  In general, release windows can be found on
>>> >> > the
>>> >> > wiki.
>>> >> >
>>> >> > 2.1 has a lot of stability fixes as well as the kafka support you
>>> >> > mentioned.
>>> >> > It may also include some of the following.
>>> >> >
>>> >> > The items I'd like to start thinking about next are:
>>> >> >  - Evicting state from the store based on event time watermarks
>>> >> >  - Sessionization (grouping together related events by key / eventTime)
>>> >> >  - Improvements to the query planner (remove some of the restrictions 
>>> >> > on
>>> >> > what queries can be run).
>>> >> >
>>> >> > This is roughly in order based on what I've been hearing users hit the
>>> >> > most.
>>> >> > Would love more feedback on what is blocking real use cases.
>>> >> >
>>> >> > On Tue, Oct 18, 2016 at 1:51 AM, Ofir Manor >> >> > >
>>> >> > wrote:
>>> >> >>
>>> >> >> Hi,
>>> >> >> I hope it is the right forum.
>>> >> >> I am looking for some information of what to expect from
>>> >> >> StructuredStreaming in its next releases to help me choose when / 
>>> >> >> where
>>> >> >> to
>>> >> >> start using it more seriously (or where to invest in workarounds and
>>> >> >> where
>>> >> >> to wait). I couldn't find a good place where such planning discussed
>>> >> >> for 2.1
>>> >> >> (like, for example ML and SPARK-15581).
>>> >> >> I'm aware of the 2.0 documented limits
>>> >> >>
>>> >> >> (http://spark.apache.org/docs/2.0.1/structured-streaming-programming-guide.html#unsupported-operations
>>> >> >>  
>>> >> >> ),
>>> >> >> like no support for multiple aggregations levels, joins are strictly 
>>> >> >> to

Re: StructuredStreaming status

2016-10-19 Thread Matei Zaharia
Yeah, as Shivaram pointed out, there have been research projects that looked at 
it. Also, Structured Streaming was explicitly designed to not make 
microbatching part of the API or part of the output behavior (tying triggers to 
it). However, when people begin working on that is a function of demand 
relative to other features. I don't think we can commit to one plan before 
exploring more options, but basically there is Shivaram's project, which adds a 
few new concepts to the scheduler, and there's the option to reduce control 
plane latency in the current system, which hasn't been heavily optimized yet 
but should be doable (lots of systems can handle 10,000s of RPCs per second).

Matei

> On Oct 19, 2016, at 9:20 PM, Cody Koeninger  wrote:
> 
> I don't think it's just about what to target - if you could target 1ms 
> batches, without harming 1 second or 1 minute batches why wouldn't you?
> I think it's about having a clear strategy and dedicating resources to it. If 
>  scheduling batches at an order of magnitude or two lower latency is the 
> strategy, and that's actually feasible, that's great. But I haven't seen that 
> clear direction, and this is by no means a recent issue.
> 
> 
> On Oct 19, 2016 7:36 PM, "Matei Zaharia"  > wrote:
> I'm also curious whether there are concerns other than latency with the way 
> stuff executes in Structured Streaming (now that the time steps don't have to 
> act as triggers), as well as what latency people want for various apps.
> 
> The stateful operator designs for streaming systems aren't inherently 
> "better" than micro-batching -- they lose a lot of stuff that is possible in 
> Spark, such as load balancing work dynamically across nodes, speculative 
> execution for stragglers, scaling clusters up and down elastically, etc. 
> Moreover, Spark itself could execute the current model with much lower 
> latency. The question is just what combinations of latency, throughput, fault 
> recovery, etc to target.
> 
> Matei
> 
>> On Oct 19, 2016, at 2:18 PM, Amit Sela > > wrote:
>> 
>> 
>> 
>> On Thu, Oct 20, 2016 at 12:07 AM Shivaram Venkataraman 
>> mailto:shiva...@eecs.berkeley.edu>> wrote:
>> At the AMPLab we've been working on a research project that looks at
>> just the scheduling latencies and on techniques to get lower
>> scheduling latency. It moves away from the micro-batch model, but
>> reuses the fault tolerance etc. in Spark. However we haven't yet
>> figure out all the parts in integrating this with the rest of
>> structured streaming. I'll try to post a design doc / SIP about this
>> soon.
>> 
>> On a related note - are there other problems users face with
>> micro-batch other than latency ?
>> I think that the fact that they serve as an output trigger is a problem, but 
>> Structured Streaming seems to resolve this now.  
>> 
>> Thanks
>> Shivaram
>> 
>> On Wed, Oct 19, 2016 at 1:29 PM, Michael Armbrust
>> mailto:mich...@databricks.com>> wrote:
>> > I know people are seriously thinking about latency.  So far that has not
>> > been the limiting factor in the users I've been working with.
>> >
>> > On Wed, Oct 19, 2016 at 1:11 PM, Cody Koeninger > > > wrote:
>> >>
>> >> Is anyone seriously thinking about alternatives to microbatches?
>> >>
>> >> On Wed, Oct 19, 2016 at 2:45 PM, Michael Armbrust
>> >> mailto:mich...@databricks.com>> wrote:
>> >> > Anything that is actively being designed should be in JIRA, and it seems
>> >> > like you found most of it.  In general, release windows can be found on
>> >> > the
>> >> > wiki.
>> >> >
>> >> > 2.1 has a lot of stability fixes as well as the kafka support you
>> >> > mentioned.
>> >> > It may also include some of the following.
>> >> >
>> >> > The items I'd like to start thinking about next are:
>> >> >  - Evicting state from the store based on event time watermarks
>> >> >  - Sessionization (grouping together related events by key / eventTime)
>> >> >  - Improvements to the query planner (remove some of the restrictions on
>> >> > what queries can be run).
>> >> >
>> >> > This is roughly in order based on what I've been hearing users hit the
>> >> > most.
>> >> > Would love more feedback on what is blocking real use cases.
>> >> >
>> >> > On Tue, Oct 18, 2016 at 1:51 AM, Ofir Manor > >> > >
>> >> > wrote:
>> >> >>
>> >> >> Hi,
>> >> >> I hope it is the right forum.
>> >> >> I am looking for some information of what to expect from
>> >> >> StructuredStreaming in its next releases to help me choose when / where
>> >> >> to
>> >> >> start using it more seriously (or where to invest in workarounds and
>> >> >> where
>> >> >> to wait). I couldn't find a good place where such planning discussed
>> >> >> for 2.1
>> >> >> (like, for example ML and SPARK-15581).
>> >> >> I'm aware of the 2.0 documented limits
>> >> >>
>> >> >> (http://spark.apache.org/docs/2.0.1/structured-streaming-programming-guid

Re: StructuredStreaming status

2016-10-19 Thread Cody Koeninger
I don't think it's just about what to target - if you could target 1ms
batches, without harming 1 second or 1 minute batches why wouldn't you?
I think it's about having a clear strategy and dedicating resources to it.
If  scheduling batches at an order of magnitude or two lower latency is the
strategy, and that's actually feasible, that's great. But I haven't seen
that clear direction, and this is by no means a recent issue.

On Oct 19, 2016 7:36 PM, "Matei Zaharia"  wrote:

> I'm also curious whether there are concerns other than latency with the
> way stuff executes in Structured Streaming (now that the time steps don't
> have to act as triggers), as well as what latency people want for various
> apps.
>
> The stateful operator designs for streaming systems aren't inherently
> "better" than micro-batching -- they lose a lot of stuff that is possible
> in Spark, such as load balancing work dynamically across nodes, speculative
> execution for stragglers, scaling clusters up and down elastically, etc.
> Moreover, Spark itself could execute the current model with much lower
> latency. The question is just what combinations of latency, throughput,
> fault recovery, etc to target.
>
> Matei
>
> On Oct 19, 2016, at 2:18 PM, Amit Sela  wrote:
>
>
>
> On Thu, Oct 20, 2016 at 12:07 AM Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
>
>> At the AMPLab we've been working on a research project that looks at
>> just the scheduling latencies and on techniques to get lower
>> scheduling latency. It moves away from the micro-batch model, but
>> reuses the fault tolerance etc. in Spark. However we haven't yet
>> figure out all the parts in integrating this with the rest of
>> structured streaming. I'll try to post a design doc / SIP about this
>> soon.
>>
>> On a related note - are there other problems users face with
>> micro-batch other than latency ?
>>
> I think that the fact that they serve as an output trigger is a problem,
> but Structured Streaming seems to resolve this now.
>
>>
>> Thanks
>> Shivaram
>>
>> On Wed, Oct 19, 2016 at 1:29 PM, Michael Armbrust
>>  wrote:
>> > I know people are seriously thinking about latency.  So far that has not
>> > been the limiting factor in the users I've been working with.
>> >
>> > On Wed, Oct 19, 2016 at 1:11 PM, Cody Koeninger 
>> wrote:
>> >>
>> >> Is anyone seriously thinking about alternatives to microbatches?
>> >>
>> >> On Wed, Oct 19, 2016 at 2:45 PM, Michael Armbrust
>> >>  wrote:
>> >> > Anything that is actively being designed should be in JIRA, and it
>> seems
>> >> > like you found most of it.  In general, release windows can be found
>> on
>> >> > the
>> >> > wiki.
>> >> >
>> >> > 2.1 has a lot of stability fixes as well as the kafka support you
>> >> > mentioned.
>> >> > It may also include some of the following.
>> >> >
>> >> > The items I'd like to start thinking about next are:
>> >> >  - Evicting state from the store based on event time watermarks
>> >> >  - Sessionization (grouping together related events by key /
>> eventTime)
>> >> >  - Improvements to the query planner (remove some of the
>> restrictions on
>> >> > what queries can be run).
>> >> >
>> >> > This is roughly in order based on what I've been hearing users hit
>> the
>> >> > most.
>> >> > Would love more feedback on what is blocking real use cases.
>> >> >
>> >> > On Tue, Oct 18, 2016 at 1:51 AM, Ofir Manor 
>> >> > wrote:
>> >> >>
>> >> >> Hi,
>> >> >> I hope it is the right forum.
>> >> >> I am looking for some information of what to expect from
>> >> >> StructuredStreaming in its next releases to help me choose when /
>> where
>> >> >> to
>> >> >> start using it more seriously (or where to invest in workarounds and
>> >> >> where
>> >> >> to wait). I couldn't find a good place where such planning discussed
>> >> >> for 2.1
>> >> >> (like, for example ML and SPARK-15581).
>> >> >> I'm aware of the 2.0 documented limits
>> >> >>
>> >> >> (http://spark.apache.org/docs/2.0.1/structured-streaming-
>> programming-guide.html#unsupported-operations),
>> >> >> like no support for multiple aggregations levels, joins are
>> strictly to
>> >> >> a
>> >> >> static dataset (no SCD or stream-stream) etc, limited sources /
>> sinks
>> >> >> (like
>> >> >> no sink for interactive queries) etc etc
>> >> >> I'm also aware of some changes that have landed in master, like the
>> new
>> >> >> Kafka 0.10 source (and its on-going improvements) in SPARK-15406,
>> the
>> >> >> metrics in SPARK-17731, and some improvements for the file source.
>> >> >> If I remember correctly, the discussion on Spark release cadence
>> >> >> concluded
>> >> >> with a preference to a four-month cycles, with likely code freeze
>> >> >> pretty
>> >> >> soon (end of October). So I believe the scope for 2.1 should likely
>> >> >> quite
>> >> >> clear to some, and that 2.2 planning should likely be starting about
>> >> >> now.
>> >> >> Any visibility / sharing will be highly appreciated!
>> >> >> thanks in advance,
>> >>

Re: StructuredStreaming status

2016-10-19 Thread Matei Zaharia
I'm also curious whether there are concerns other than latency with the way 
stuff executes in Structured Streaming (now that the time steps don't have to 
act as triggers), as well as what latency people want for various apps.

The stateful operator designs for streaming systems aren't inherently "better" 
than micro-batching -- they lose a lot of stuff that is possible in Spark, such 
as load balancing work dynamically across nodes, speculative execution for 
stragglers, scaling clusters up and down elastically, etc. Moreover, Spark 
itself could execute the current model with much lower latency. The question is 
just what combinations of latency, throughput, fault recovery, etc to target.

Matei

> On Oct 19, 2016, at 2:18 PM, Amit Sela  wrote:
> 
> 
> 
> On Thu, Oct 20, 2016 at 12:07 AM Shivaram Venkataraman 
> mailto:shiva...@eecs.berkeley.edu>> wrote:
> At the AMPLab we've been working on a research project that looks at
> just the scheduling latencies and on techniques to get lower
> scheduling latency. It moves away from the micro-batch model, but
> reuses the fault tolerance etc. in Spark. However we haven't yet
> figure out all the parts in integrating this with the rest of
> structured streaming. I'll try to post a design doc / SIP about this
> soon.
> 
> On a related note - are there other problems users face with
> micro-batch other than latency ?
> I think that the fact that they serve as an output trigger is a problem, but 
> Structured Streaming seems to resolve this now.  
> 
> Thanks
> Shivaram
> 
> On Wed, Oct 19, 2016 at 1:29 PM, Michael Armbrust
> mailto:mich...@databricks.com>> wrote:
> > I know people are seriously thinking about latency.  So far that has not
> > been the limiting factor in the users I've been working with.
> >
> > On Wed, Oct 19, 2016 at 1:11 PM, Cody Koeninger  > > wrote:
> >>
> >> Is anyone seriously thinking about alternatives to microbatches?
> >>
> >> On Wed, Oct 19, 2016 at 2:45 PM, Michael Armbrust
> >> mailto:mich...@databricks.com>> wrote:
> >> > Anything that is actively being designed should be in JIRA, and it seems
> >> > like you found most of it.  In general, release windows can be found on
> >> > the
> >> > wiki.
> >> >
> >> > 2.1 has a lot of stability fixes as well as the kafka support you
> >> > mentioned.
> >> > It may also include some of the following.
> >> >
> >> > The items I'd like to start thinking about next are:
> >> >  - Evicting state from the store based on event time watermarks
> >> >  - Sessionization (grouping together related events by key / eventTime)
> >> >  - Improvements to the query planner (remove some of the restrictions on
> >> > what queries can be run).
> >> >
> >> > This is roughly in order based on what I've been hearing users hit the
> >> > most.
> >> > Would love more feedback on what is blocking real use cases.
> >> >
> >> > On Tue, Oct 18, 2016 at 1:51 AM, Ofir Manor  >> > >
> >> > wrote:
> >> >>
> >> >> Hi,
> >> >> I hope it is the right forum.
> >> >> I am looking for some information of what to expect from
> >> >> StructuredStreaming in its next releases to help me choose when / where
> >> >> to
> >> >> start using it more seriously (or where to invest in workarounds and
> >> >> where
> >> >> to wait). I couldn't find a good place where such planning discussed
> >> >> for 2.1
> >> >> (like, for example ML and SPARK-15581).
> >> >> I'm aware of the 2.0 documented limits
> >> >>
> >> >> (http://spark.apache.org/docs/2.0.1/structured-streaming-programming-guide.html#unsupported-operations
> >> >>  
> >> >> ),
> >> >> like no support for multiple aggregations levels, joins are strictly to
> >> >> a
> >> >> static dataset (no SCD or stream-stream) etc, limited sources / sinks
> >> >> (like
> >> >> no sink for interactive queries) etc etc
> >> >> I'm also aware of some changes that have landed in master, like the new
> >> >> Kafka 0.10 source (and its on-going improvements) in SPARK-15406, the
> >> >> metrics in SPARK-17731, and some improvements for the file source.
> >> >> If I remember correctly, the discussion on Spark release cadence
> >> >> concluded
> >> >> with a preference to a four-month cycles, with likely code freeze
> >> >> pretty
> >> >> soon (end of October). So I believe the scope for 2.1 should likely
> >> >> quite
> >> >> clear to some, and that 2.2 planning should likely be starting about
> >> >> now.
> >> >> Any visibility / sharing will be highly appreciated!
> >> >> thanks in advance,
> >> >>
> >> >> Ofir Manor
> >> >>
> >> >> Co-Founder & CTO | Equalum
> >> >>
> >> >> Mobile: +972-54-7801286  | Email: 
> >> >> ofir.ma...@equalum.io 
> >> >
> >> >
> >
> >
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
> 

Re: StructuredStreaming status

2016-10-19 Thread Ofir Manor
Thanks a lot Michael! I really appreciate your sharing.
Logistically, I suggest to find a way to tag all structured streaming
JIRAs, so it wouldn't so hard to look for them, for anyone wanting to
participate, and also have something like the ML roadmap JIRA.
regarding your list, evicting space seems very important. If I understand
correctly, currently state grows forever (when using windows), so it is
impractical to run a long-running streaming job with decent state. It would
be great if user could bound the state by event time (it is also very
natural).
I personally see sessionization as lower priority (seems like a niche
requirement). To me, supporting only a single stream of events that can
only be joined to static datasets makes building anything but the simplest
of short-running streaming jobs problematic (all interesting datasets
change over time). Also, the promise of interactive queries on top of a
computed, live dataset likely has a wider appeal (as it was presented since
early this year as one of the goals of structured streaming). Also making
the sources and sinks API nicer to third-party developers to encourage
adoption and plugins, or beefing up the list of builtin exactly-once
sources and sinks (maybe also have a pluggable state store, as I've seen
some wanting, which may better enable interactive queries).
In addition, I think you should really identify what needs to be done to
make this API stable and focus on that. I think that for adoption, you'll
need to be clear on the full list of gaps / gotchas, and clearly
communicate the project priorities / target timeline (again, just like ML
does it), hopefully after some community discussion...

On a personal note, I'm quite surprised that this is all the progress in
Structured Streaming over the last three months since 2.0 was released. I
was under the impression that this was one of the biggest things that the
Spark community actively works on, but that is clearly not the case, given
that most of the activity is a couple of (very important) JIRAs from the
last several weeks. Not really sure how to parse that yet...
I think having some clearer, prioritized roadmap going forward will be a
good first to recalibrate expectations for 2.2 and for graduating from an
alpha state. But especially, I think you guys seriously needs to figure out
what's the bottleneck here (lack of dedicated owner? lack of commiters
focusing on it?) and just fix it (recruit new commiters to work on it?) to
have a competitive streaming offering in a few quarters.

Just my two cents,

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Wed, Oct 19, 2016 at 10:45 PM, Michael Armbrust 
wrote:

> Anything that is actively being designed should be in JIRA, and it seems
> like you found most of it.  In general, release windows can be found on the
> wiki .
>
> 2.1 has a lot of stability fixes as well as the kafka support you
> mentioned.  It may also include some of the following.
>
> The items I'd like to start thinking about next are:
>  - Evicting state from the store based on event time watermarks
>  - Sessionization (grouping together related events by key / eventTime)
>  - Improvements to the query planner (remove some of the restrictions on
> what queries can be run).
>
> This is roughly in order based on what I've been hearing users hit the
> most.  Would love more feedback on what is blocking real use cases.
>
> On Tue, Oct 18, 2016 at 1:51 AM, Ofir Manor  wrote:
>
>> Hi,
>> I hope it is the right forum.
>> I am looking for some information of what to expect from
>> StructuredStreaming in its next releases to help me choose when / where to
>> start using it more seriously (or where to invest in workarounds and where
>> to wait). I couldn't find a good place where such planning discussed for
>> 2.1  (like, for example ML and SPARK-15581).
>> I'm aware of the 2.0 documented limits (http://spark.apache.org/docs/
>> 2.0.1/structured-streaming-programming-guide.html#unsupported-operations),
>> like no support for multiple aggregations levels, joins are strictly to a
>> static dataset (no SCD or stream-stream) etc, limited sources / sinks (like
>> no sink for interactive queries) etc etc
>> I'm also aware of some changes that have landed in master, like the new
>> Kafka 0.10 source (and its on-going improvements) in SPARK-15406, the
>> metrics in SPARK-17731, and some improvements for the file source.
>> If I remember correctly, the discussion on Spark release cadence
>> concluded with a preference to a four-month cycles, with likely code freeze
>> pretty soon (end of October). So I believe the scope for 2.1 should likely
>> quite clear to some, and that 2.2 planning should likely be starting about
>> now.
>> Any visibility / sharing will be highly appreciated!
>> thanks in advance,
>>
>> Ofir Manor
>>
>> Co-Founder & CTO | Equalum
>>
>> Mobile: +972-54-7801286 | Emai

Re: StructuredStreaming status

2016-10-19 Thread Amit Sela
On Thu, Oct 20, 2016 at 12:07 AM Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> At the AMPLab we've been working on a research project that looks at
> just the scheduling latencies and on techniques to get lower
> scheduling latency. It moves away from the micro-batch model, but
> reuses the fault tolerance etc. in Spark. However we haven't yet
> figure out all the parts in integrating this with the rest of
> structured streaming. I'll try to post a design doc / SIP about this
> soon.
>
> On a related note - are there other problems users face with
> micro-batch other than latency ?
>
I think that the fact that they serve as an output trigger is a problem,
but Structured Streaming seems to resolve this now.

>
> Thanks
> Shivaram
>
> On Wed, Oct 19, 2016 at 1:29 PM, Michael Armbrust
>  wrote:
> > I know people are seriously thinking about latency.  So far that has not
> > been the limiting factor in the users I've been working with.
> >
> > On Wed, Oct 19, 2016 at 1:11 PM, Cody Koeninger 
> wrote:
> >>
> >> Is anyone seriously thinking about alternatives to microbatches?
> >>
> >> On Wed, Oct 19, 2016 at 2:45 PM, Michael Armbrust
> >>  wrote:
> >> > Anything that is actively being designed should be in JIRA, and it
> seems
> >> > like you found most of it.  In general, release windows can be found
> on
> >> > the
> >> > wiki.
> >> >
> >> > 2.1 has a lot of stability fixes as well as the kafka support you
> >> > mentioned.
> >> > It may also include some of the following.
> >> >
> >> > The items I'd like to start thinking about next are:
> >> >  - Evicting state from the store based on event time watermarks
> >> >  - Sessionization (grouping together related events by key /
> eventTime)
> >> >  - Improvements to the query planner (remove some of the restrictions
> on
> >> > what queries can be run).
> >> >
> >> > This is roughly in order based on what I've been hearing users hit the
> >> > most.
> >> > Would love more feedback on what is blocking real use cases.
> >> >
> >> > On Tue, Oct 18, 2016 at 1:51 AM, Ofir Manor 
> >> > wrote:
> >> >>
> >> >> Hi,
> >> >> I hope it is the right forum.
> >> >> I am looking for some information of what to expect from
> >> >> StructuredStreaming in its next releases to help me choose when /
> where
> >> >> to
> >> >> start using it more seriously (or where to invest in workarounds and
> >> >> where
> >> >> to wait). I couldn't find a good place where such planning discussed
> >> >> for 2.1
> >> >> (like, for example ML and SPARK-15581).
> >> >> I'm aware of the 2.0 documented limits
> >> >>
> >> >> (
> http://spark.apache.org/docs/2.0.1/structured-streaming-programming-guide.html#unsupported-operations
> ),
> >> >> like no support for multiple aggregations levels, joins are strictly
> to
> >> >> a
> >> >> static dataset (no SCD or stream-stream) etc, limited sources / sinks
> >> >> (like
> >> >> no sink for interactive queries) etc etc
> >> >> I'm also aware of some changes that have landed in master, like the
> new
> >> >> Kafka 0.10 source (and its on-going improvements) in SPARK-15406, the
> >> >> metrics in SPARK-17731, and some improvements for the file source.
> >> >> If I remember correctly, the discussion on Spark release cadence
> >> >> concluded
> >> >> with a preference to a four-month cycles, with likely code freeze
> >> >> pretty
> >> >> soon (end of October). So I believe the scope for 2.1 should likely
> >> >> quite
> >> >> clear to some, and that 2.2 planning should likely be starting about
> >> >> now.
> >> >> Any visibility / sharing will be highly appreciated!
> >> >> thanks in advance,
> >> >>
> >> >> Ofir Manor
> >> >>
> >> >> Co-Founder & CTO | Equalum
> >> >>
> >> >> Mobile: +972-54-7801286 <054-780-1286> | Email:
> ofir.ma...@equalum.io
> >> >
> >> >
> >
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: StructuredStreaming status

2016-10-19 Thread Shivaram Venkataraman
At the AMPLab we've been working on a research project that looks at
just the scheduling latencies and on techniques to get lower
scheduling latency. It moves away from the micro-batch model, but
reuses the fault tolerance etc. in Spark. However we haven't yet
figure out all the parts in integrating this with the rest of
structured streaming. I'll try to post a design doc / SIP about this
soon.

On a related note - are there other problems users face with
micro-batch other than latency ?

Thanks
Shivaram

On Wed, Oct 19, 2016 at 1:29 PM, Michael Armbrust
 wrote:
> I know people are seriously thinking about latency.  So far that has not
> been the limiting factor in the users I've been working with.
>
> On Wed, Oct 19, 2016 at 1:11 PM, Cody Koeninger  wrote:
>>
>> Is anyone seriously thinking about alternatives to microbatches?
>>
>> On Wed, Oct 19, 2016 at 2:45 PM, Michael Armbrust
>>  wrote:
>> > Anything that is actively being designed should be in JIRA, and it seems
>> > like you found most of it.  In general, release windows can be found on
>> > the
>> > wiki.
>> >
>> > 2.1 has a lot of stability fixes as well as the kafka support you
>> > mentioned.
>> > It may also include some of the following.
>> >
>> > The items I'd like to start thinking about next are:
>> >  - Evicting state from the store based on event time watermarks
>> >  - Sessionization (grouping together related events by key / eventTime)
>> >  - Improvements to the query planner (remove some of the restrictions on
>> > what queries can be run).
>> >
>> > This is roughly in order based on what I've been hearing users hit the
>> > most.
>> > Would love more feedback on what is blocking real use cases.
>> >
>> > On Tue, Oct 18, 2016 at 1:51 AM, Ofir Manor 
>> > wrote:
>> >>
>> >> Hi,
>> >> I hope it is the right forum.
>> >> I am looking for some information of what to expect from
>> >> StructuredStreaming in its next releases to help me choose when / where
>> >> to
>> >> start using it more seriously (or where to invest in workarounds and
>> >> where
>> >> to wait). I couldn't find a good place where such planning discussed
>> >> for 2.1
>> >> (like, for example ML and SPARK-15581).
>> >> I'm aware of the 2.0 documented limits
>> >>
>> >> (http://spark.apache.org/docs/2.0.1/structured-streaming-programming-guide.html#unsupported-operations),
>> >> like no support for multiple aggregations levels, joins are strictly to
>> >> a
>> >> static dataset (no SCD or stream-stream) etc, limited sources / sinks
>> >> (like
>> >> no sink for interactive queries) etc etc
>> >> I'm also aware of some changes that have landed in master, like the new
>> >> Kafka 0.10 source (and its on-going improvements) in SPARK-15406, the
>> >> metrics in SPARK-17731, and some improvements for the file source.
>> >> If I remember correctly, the discussion on Spark release cadence
>> >> concluded
>> >> with a preference to a four-month cycles, with likely code freeze
>> >> pretty
>> >> soon (end of October). So I believe the scope for 2.1 should likely
>> >> quite
>> >> clear to some, and that 2.2 planning should likely be starting about
>> >> now.
>> >> Any visibility / sharing will be highly appreciated!
>> >> thanks in advance,
>> >>
>> >> Ofir Manor
>> >>
>> >> Co-Founder & CTO | Equalum
>> >>
>> >> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>> >
>> >
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: StructuredStreaming status

2016-10-19 Thread Amit Sela
I've been working on the Apache Beam Spark runner which is (in this
context) basically running a streaming model that focuses on event-time and
correctness with Spark, and as I see it (even in spark 1.6.x) the
micro-batches are really just added latency, which will work-out for some
users, and not for others and that's OK. Structured Streaming triggers make
it even better with computing on trigger (other systems do it constantly,
but output only on trigger so no much difference there).

I'm actually curious about a couple of things:

   - State store API - having a state API available is extremely useful for
   streaming in many fronts:
  - Available for sources (and sinks ?) to avoid immortalizing
  micro-batch reads (simply let tasks pick-off where they left the previous
  micro-batch).
  - Can help rid of resuming from checkpoint, you can simply restart -
  that goes for upgrading Spark jobs as well as wrapping accumulators and
  broadcasts in getOrCreate methods (an of not resuming from checkpoint you
  can avoid wrapping you DAG construction in getOrCreate as well).
  - The fact that it aims to be pluggable enabling building platforms
  (not just frameworks) with Spark.
  - Finally, it is basically the basis for any stateful computation
  spark will support.
   - Evicting state as Michael pointed out, which currently, if using for
   example overlapping windows grows the Dataset really quickly.
   - Encoders API - Where does it stand ? will developers/users be able to
   define a custom schema for say a generically typed class ? will it get
   along with inner classes, static classes etc. ?

Thanks,
Amit

On Wed, Oct 19, 2016 at 11:30 PM Michael Armbrust 
wrote:

> I know people are seriously thinking about latency.  So far that has not
> been the limiting factor in the users I've been working with.
>
> On Wed, Oct 19, 2016 at 1:11 PM, Cody Koeninger 
> wrote:
>
> Is anyone seriously thinking about alternatives to microbatches?
>
> On Wed, Oct 19, 2016 at 2:45 PM, Michael Armbrust
>  wrote:
> > Anything that is actively being designed should be in JIRA, and it seems
> > like you found most of it.  In general, release windows can be found on
> the
> > wiki.
> >
> > 2.1 has a lot of stability fixes as well as the kafka support you
> mentioned.
> > It may also include some of the following.
> >
> > The items I'd like to start thinking about next are:
> >  - Evicting state from the store based on event time watermarks
> >  - Sessionization (grouping together related events by key / eventTime)
> >  - Improvements to the query planner (remove some of the restrictions on
> > what queries can be run).
> >
> > This is roughly in order based on what I've been hearing users hit the
> most.
> > Would love more feedback on what is blocking real use cases.
> >
> > On Tue, Oct 18, 2016 at 1:51 AM, Ofir Manor 
> wrote:
> >>
> >> Hi,
> >> I hope it is the right forum.
> >> I am looking for some information of what to expect from
> >> StructuredStreaming in its next releases to help me choose when / where
> to
> >> start using it more seriously (or where to invest in workarounds and
> where
> >> to wait). I couldn't find a good place where such planning discussed
> for 2.1
> >> (like, for example ML and SPARK-15581).
> >> I'm aware of the 2.0 documented limits
> >> (
> http://spark.apache.org/docs/2.0.1/structured-streaming-programming-guide.html#unsupported-operations
> ),
> >> like no support for multiple aggregations levels, joins are strictly to
> a
> >> static dataset (no SCD or stream-stream) etc, limited sources / sinks
> (like
> >> no sink for interactive queries) etc etc
> >> I'm also aware of some changes that have landed in master, like the new
> >> Kafka 0.10 source (and its on-going improvements) in SPARK-15406, the
> >> metrics in SPARK-17731, and some improvements for the file source.
> >> If I remember correctly, the discussion on Spark release cadence
> concluded
> >> with a preference to a four-month cycles, with likely code freeze pretty
> >> soon (end of October). So I believe the scope for 2.1 should likely
> quite
> >> clear to some, and that 2.2 planning should likely be starting about
> now.
> >> Any visibility / sharing will be highly appreciated!
> >> thanks in advance,
> >>
> >> Ofir Manor
> >>
> >> Co-Founder & CTO | Equalum
> >>
> >> Mobile: +972-54-7801286 <054-780-1286> | Email: ofir.ma...@equalum.io
> >
> >
>
>
>


Re: StructuredStreaming status

2016-10-19 Thread Michael Armbrust
I know people are seriously thinking about latency.  So far that has not
been the limiting factor in the users I've been working with.

On Wed, Oct 19, 2016 at 1:11 PM, Cody Koeninger  wrote:

> Is anyone seriously thinking about alternatives to microbatches?
>
> On Wed, Oct 19, 2016 at 2:45 PM, Michael Armbrust
>  wrote:
> > Anything that is actively being designed should be in JIRA, and it seems
> > like you found most of it.  In general, release windows can be found on
> the
> > wiki.
> >
> > 2.1 has a lot of stability fixes as well as the kafka support you
> mentioned.
> > It may also include some of the following.
> >
> > The items I'd like to start thinking about next are:
> >  - Evicting state from the store based on event time watermarks
> >  - Sessionization (grouping together related events by key / eventTime)
> >  - Improvements to the query planner (remove some of the restrictions on
> > what queries can be run).
> >
> > This is roughly in order based on what I've been hearing users hit the
> most.
> > Would love more feedback on what is blocking real use cases.
> >
> > On Tue, Oct 18, 2016 at 1:51 AM, Ofir Manor 
> wrote:
> >>
> >> Hi,
> >> I hope it is the right forum.
> >> I am looking for some information of what to expect from
> >> StructuredStreaming in its next releases to help me choose when / where
> to
> >> start using it more seriously (or where to invest in workarounds and
> where
> >> to wait). I couldn't find a good place where such planning discussed
> for 2.1
> >> (like, for example ML and SPARK-15581).
> >> I'm aware of the 2.0 documented limits
> >> (http://spark.apache.org/docs/2.0.1/structured-streaming-
> programming-guide.html#unsupported-operations),
> >> like no support for multiple aggregations levels, joins are strictly to
> a
> >> static dataset (no SCD or stream-stream) etc, limited sources / sinks
> (like
> >> no sink for interactive queries) etc etc
> >> I'm also aware of some changes that have landed in master, like the new
> >> Kafka 0.10 source (and its on-going improvements) in SPARK-15406, the
> >> metrics in SPARK-17731, and some improvements for the file source.
> >> If I remember correctly, the discussion on Spark release cadence
> concluded
> >> with a preference to a four-month cycles, with likely code freeze pretty
> >> soon (end of October). So I believe the scope for 2.1 should likely
> quite
> >> clear to some, and that 2.2 planning should likely be starting about
> now.
> >> Any visibility / sharing will be highly appreciated!
> >> thanks in advance,
> >>
> >> Ofir Manor
> >>
> >> Co-Founder & CTO | Equalum
> >>
> >> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
> >
> >
>


Re: StructuredStreaming status

2016-10-19 Thread Cody Koeninger
Is anyone seriously thinking about alternatives to microbatches?

On Wed, Oct 19, 2016 at 2:45 PM, Michael Armbrust
 wrote:
> Anything that is actively being designed should be in JIRA, and it seems
> like you found most of it.  In general, release windows can be found on the
> wiki.
>
> 2.1 has a lot of stability fixes as well as the kafka support you mentioned.
> It may also include some of the following.
>
> The items I'd like to start thinking about next are:
>  - Evicting state from the store based on event time watermarks
>  - Sessionization (grouping together related events by key / eventTime)
>  - Improvements to the query planner (remove some of the restrictions on
> what queries can be run).
>
> This is roughly in order based on what I've been hearing users hit the most.
> Would love more feedback on what is blocking real use cases.
>
> On Tue, Oct 18, 2016 at 1:51 AM, Ofir Manor  wrote:
>>
>> Hi,
>> I hope it is the right forum.
>> I am looking for some information of what to expect from
>> StructuredStreaming in its next releases to help me choose when / where to
>> start using it more seriously (or where to invest in workarounds and where
>> to wait). I couldn't find a good place where such planning discussed for 2.1
>> (like, for example ML and SPARK-15581).
>> I'm aware of the 2.0 documented limits
>> (http://spark.apache.org/docs/2.0.1/structured-streaming-programming-guide.html#unsupported-operations),
>> like no support for multiple aggregations levels, joins are strictly to a
>> static dataset (no SCD or stream-stream) etc, limited sources / sinks (like
>> no sink for interactive queries) etc etc
>> I'm also aware of some changes that have landed in master, like the new
>> Kafka 0.10 source (and its on-going improvements) in SPARK-15406, the
>> metrics in SPARK-17731, and some improvements for the file source.
>> If I remember correctly, the discussion on Spark release cadence concluded
>> with a preference to a four-month cycles, with likely code freeze pretty
>> soon (end of October). So I believe the scope for 2.1 should likely quite
>> clear to some, and that 2.2 planning should likely be starting about now.
>> Any visibility / sharing will be highly appreciated!
>> thanks in advance,
>>
>> Ofir Manor
>>
>> Co-Founder & CTO | Equalum
>>
>> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: StructuredStreaming status

2016-10-19 Thread Michael Armbrust
Anything that is actively being designed should be in JIRA, and it seems
like you found most of it.  In general, release windows can be found on the
wiki .

2.1 has a lot of stability fixes as well as the kafka support you
mentioned.  It may also include some of the following.

The items I'd like to start thinking about next are:
 - Evicting state from the store based on event time watermarks
 - Sessionization (grouping together related events by key / eventTime)
 - Improvements to the query planner (remove some of the restrictions on
what queries can be run).

This is roughly in order based on what I've been hearing users hit the
most.  Would love more feedback on what is blocking real use cases.

On Tue, Oct 18, 2016 at 1:51 AM, Ofir Manor  wrote:

> Hi,
> I hope it is the right forum.
> I am looking for some information of what to expect from
> StructuredStreaming in its next releases to help me choose when / where to
> start using it more seriously (or where to invest in workarounds and where
> to wait). I couldn't find a good place where such planning discussed for
> 2.1  (like, for example ML and SPARK-15581).
> I'm aware of the 2.0 documented limits (http://spark.apache.org/docs/
> 2.0.1/structured-streaming-programming-guide.html#unsupported-operations),
> like no support for multiple aggregations levels, joins are strictly to a
> static dataset (no SCD or stream-stream) etc, limited sources / sinks (like
> no sink for interactive queries) etc etc
> I'm also aware of some changes that have landed in master, like the new
> Kafka 0.10 source (and its on-going improvements) in SPARK-15406, the
> metrics in SPARK-17731, and some improvements for the file source.
> If I remember correctly, the discussion on Spark release cadence concluded
> with a preference to a four-month cycles, with likely code freeze pretty
> soon (end of October). So I believe the scope for 2.1 should likely quite
> clear to some, and that 2.2 planning should likely be starting about now.
> Any visibility / sharing will be highly appreciated!
> thanks in advance,
>
> Ofir Manor
>
> Co-Founder & CTO | Equalum
>
> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>