Re: Plan on Structured Streaming in next major/minor release?

Jungtaek Lim Tue, 30 Oct 2018 16:03:53 -0700

Thanks Micheal for explaining activity on SS as well as giving opinion on
some items!


Replying inline.

2018년 10월 31일 (수) 오전 5:44, Michael Armbrust <mich...@databricks.com>님이 작성:

> Thanks for bringing up some possible future directions for streaming. Here
> are some thoughts:
>  - I personally view all of the activity on Spark SQL also as activity on
> Structured Streaming. The great thing about building streaming on catalyst
> / tungsten is that continued improvement to these components improves
> streaming use cases as well.
>

While I agree with you (in terms of performance and improvement on built-in
functions), like I enumerated, streaming area has its own features which
require another efforts. It would be also great if someone resumes putting
major efforts on continuous mode (Spark specific project): I guess we were
waiting on barrier execution. I'm happy to help on reviewing design doc,
taking up implementing part(s) of.


>  - I think the biggest on-going project is DataSourceV2, whose goal is to
> provide a stable / performant API for streaming and batch data sources to
> plug in.  I think connectivity to many different systems is one of the most
> powerful aspects of Spark and right now there is no stable public API for
> streaming. A lot of committer / PMC time is being spent here at the moment.
>

100% agree that DSv2 should be the thing to be stabilized sooner than
later, and understand major efforts are going there.


>  - As you mention, 2.4.0 significantly improves the built in connectivity
> for Kafka, giving us the ability to read exactly once from a topic being
> written to transactional producers. I think projects to extend this
> guarantee to the Kafka Sink and also to improve authentication with Kafka
> are a great idea (and it seems like there is a lot of review activity on
> the latter).
>

Actually I was spending time to design former, and realized that it should
give up either scalability or transactional to respect Spark's contract on
exactly-once. (Most of storages don't support transaction on multiple
connections so transaction can't be achieved among tasks. They also don't
support moving data without resending.) That's what I sent a mail in
different mail thread on lessening contract. I think it is related to DSv2
and need to be considered while discussing DSv2, since the issue is not
only for Kafka, but also most of external storages.


> You bring up some other possible projects like session window support.
> This is an interesting project, but as far as I can tell it still there is
> still a lot of work that would need to be done before this feature could be
> merged.  We'd need to understand how it works with update mode amongst
> other things. Additionally, a 3000+ line patch is really time consuming to
> review. This coupled with the fact that all the users that I have
> interacted with need "session windows + some custom business logic"
> (usually implemented with flatMapGroupsWithState), mean that I'm more
> inclined to direct limited review bandwidth to incremental improvements in
> that feature than to something large/new. This is not to say that this
> feature isn't useful / shouldn't be merge, just a bit of explanation as to
> why there might be less activity here than you would hope.
>

Yeah while I would like to get another feedbacks on session window stuff
(because without feedback I need to explore all possible paths by myself
without any help), I didn't intend to get attraction on session window in
this mail thread. The rationalization on this mail thread is to get
attraction on broader area: features support on streaming area.

Anyway, thanks for explaining! For individual contributors, determining
whether the proposal is (softly) rejected or not is very important in terms
of further investigation, and it helped much on understanding current
status.


> Similarly, multiple aggregations are an often requested feature.  However,
> fundamentally, this is going to be a fairly large investment (I think we'd
> need to combine the unsupported operation checker and the query planner and
> also create a high performance (i.e. whole stage code-gened) aggregation
> operator that understands negation).
>

Agree. Just curious, could you explain what do you mean by "negation"? Does
it mean applying retraction on aggregated?


> Thanks again for starting the discussion, and looking forward to hearing
> about what features are most requested!
>
> On Tue, Oct 30, 2018 at 12:23 AM Jungtaek Lim <kabh...@gmail.com> wrote:
>
>> Adding more: again, it doesn't mean they're feasible to do. Just a kind
>> of brainstorming.
>>
>> * SPARK-20568: Delete files after processing in structured streaming
>>   * There hasn't been consensus regarding supporting this: there were
>> voices for both YES and NO.
>> * Support multiple levels of aggregations in structured streaming
>>   * There're plenty of questions in SO regarding this. While I don't
>> think it makes sense on structured streaming if it requires additional
>> shuffle, there might be another case: group by keys, apply aggregation,
>> apply aggregation on aggregated result (grouped keys don't change)
>>
>> 2018년 10월 22일 (월) 오후 12:25, Jungtaek Lim <kabh...@gmail.com>님이 작성:
>>
>>> Yeah, the main intention of this thread is to collect interest on
>>> possible feature list for structured streaming. From what I can see in
>>> Spark community, most of the discussions as well as contributions are for
>>> SQL, and I'd wish to see similar activeness / efforts on structured
>>> streaming.
>>> (Unfortunately there's less effort to review others' works - design doc
>>> as well as pull request - most of efforts looks like being spent to their
>>> own works.)
>>>
>>> I respect the role of PMC member, so the final decision would be up to
>>> PMC members, but contributors as well as end users could show the interest
>>> as well as discuss about requirements on SPIP, which could be a good
>>> background to persuade PMC members.
>>>
>>> Before going into the deep I guess we could use this thread to discuss
>>> about possible use cases, and if we would like to move forward to
>>> individual thread we could initiate (or resurrect) its discussion thread.
>>>
>>> For queryable state, at least there seems no workaround in Spark to
>>> provide similar thing, especially state is getting bigger. I may have some
>>> concerns on the details, but I'll add my thought on the discussion thread.
>>>
>>> - Jungtaek Lim (HeartSaVioR)
>>>
>>> 2018년 10월 22일 (월) 오전 1:15, Stavros Kontopoulos <
>>> stavros.kontopou...@lightbend.com>님이 작성:
>>>
>>>> Hi Jungtaek,
>>>>
>>>> I just tried to start the discussion in the dev list along time ago.
>>>> I enumerated some uses cases as Michael proposed here
>>>> <http://mail-archives.apache.org/mod_mbox/spark-dev/201712.mbox/%3CCACTd3c_snT=y4r9vod+ebty1fdgtqsxzgjgubox-k8araur...@mail.gmail.com%3E>.
>>>> The discussion didn't go further.
>>>>
>>>> If people find it useful we should start discussing it in detail again.
>>>>
>>>> Stavros
>>>>
>>>> On Sun, Oct 21, 2018 at 4:54 PM, Jungtaek Lim <kabh...@gmail.com>
>>>> wrote:
>>>>
>>>>> Stavros, if my memory is right, you were trying to drive queryable
>>>>> state, right?
>>>>>
>>>>> Could you summary the progress and the reason why the progress got
>>>>> stopped?
>>>>>
>>>>> 2018년 10월 21일 (일) 오후 10:27, Stavros Kontopoulos <
>>>>> stavros.kontopou...@lightbend.com>님이 작성:
>>>>>
>>>>>> That is a very interesting list thanks. I could create a design doc
>>>>>> as a starting pointing for discussion if this is a feature we would like 
>>>>>> to
>>>>>> have.
>>>>>>
>>>>>> Regards,
>>>>>> Stavros
>>>>>>
>>>>>> On Sun, Oct 21, 2018 at 3:04 PM, JackyLee <qcsd2...@163.com> wrote:
>>>>>>
>>>>>>> Thanks for raising them.
>>>>>>>
>>>>>>> FYI, I believe this open issues could also be considered:
>>>>>>>
>>>>>>> https://issues.apache.org/jira/browse/SPARK-24630
>>>>>>> <https://issues.apache.org/jira/browse/SPARK-24630>
>>>>>>>
>>>>>>> An new ability to express Struct Streaming on pure SQL.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Sent from:
>>>>>>> http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>

Re: Plan on Structured Streaming in next major/minor release?

Reply via email to