Re: [spark structured streaming runner] merge to master?

Alexey Romanenko Wed, 30 Oct 2019 11:13:00 -0700

Yes, agree, two jars included in uber jar will work in the similar way. Though 
having 3 jars looks still quite confusing for me.


> On 29 Oct 2019, at 23:54, Kenneth Knowles <k...@apache.org> wrote:
> 
> Is it just as easy to have two jars and build an uber jar with both included? 
> Then the runner can still be toggled with a flag.
> 
> Kenn
> 
> On Tue, Oct 29, 2019 at 9:38 AM Alexey Romanenko <aromanenko....@gmail.com 
> <mailto:aromanenko....@gmail.com>> wrote:
> Hmm, I don’t think that jar size should play a big role comparing to the 
> whole size of shaded jar of users job. Even more, I think it will be quite 
> confusing for users to choose which jar to use if we will have 3 different 
> ones for similar purposes. Though, let’s see what others think.
> 
>> On 29 Oct 2019, at 15:32, Etienne Chauchot <echauc...@apache.org 
>> <mailto:echauc...@apache.org>> wrote:
>> 
>> Hi Alexey, 
>> Thanks for your opinion !
>> 
>> Comments inline
>> 
>> Etienne
>> On 28/10/2019 17:34, Alexey Romanenko wrote:
>>> Let me share some of my thoughts on this.
>>>>>     - shall we filter out the package name from the release? 
>>> Until new runner is not ready to be used in production (or, at least, be 
>>> used for beta testing but users should be clearly warned about that in this 
>>> case), I believe we need to filter out its classes from published jar to 
>>> avoid a confusion.
>> Yes that is what I think also
>>>>>     - should we release 2 jars: one for the old and one for the new ? 
>>>>>     - should we release 3 jars: one for the new, one for the new and one 
>>>>> for both ?
>>>>> 
>>> Once new runner will be released, then I think we need to provide only one 
>>> single jar and allow user to switch between different Spark runners with 
>>> CLI option.
>> I would vote for 3 jars: one for new, one for old, and one for both. Indeed, 
>> in some cases, users are looking very closely at the size of jars. This 
>> solution meets all use cases
>>>>>     - should we create a special entry to the capability matrix ?
>>>>> 
>>> 
>>> Sure, since it has its own uniq characteristics and implementation, but 
>>> again, only once new runner will be "officially released".
>> +1
>>> 
>>> 
>>>> On 28 Oct 2019, at 10:27, Etienne Chauchot <echauc...@apache.org 
>>>> <mailto:echauc...@apache.org>> wrote:
>>>> 
>>>> Hi guys,
>>>> 
>>>> Any opinions on the point2 communication to users ?
>>>> 
>>>> Etienne
>>>> On 24/10/2019 15:44, Etienne Chauchot wrote:
>>>>> Hi guys,
>>>>> 
>>>>> I'm glad to announce that the PR for the merge to master of the new 
>>>>> runner based on Spark Structured Streaming framework is submitted:
>>>>> 
>>>>> https://github.com/apache/beam/pull/9866 
>>>>> <https://github.com/apache/beam/pull/9866>
>>>>> 
>>>>> 1. Regarding the status of the runner: 
>>>>> -the runner passes 93% of the validates runner tests in batch mode.
>>>>> 
>>>>> -Streaming mode is barely started (waiting for the multi-aggregations 
>>>>> support in spark Structured Streaming framework from the Spark community)
>>>>> 
>>>>> -Runner can execute Nexmark
>>>>> 
>>>>> -Some things are not wired up yet
>>>>> 
>>>>>   -Beam Schemas not wired with Spark Schemas
>>>>> 
>>>>>   -Optional features of the model not implemented: state api, timer api, 
>>>>> splittable doFn api, …
>>>>> 
>>>>> 2. Regarding the communication to users:
>>>>> 
>>>>> - for reasons explained by Ismael: the runner is in the same module as 
>>>>> the "older" one. But it is in a different sub-package and both runners 
>>>>> share the same build.  
>>>>> - How should we communicate to users: 
>>>>>     - shall we filter out the package name from the release? 
>>>>>     - should we release 2 jars: one for the old and one for the new ? 
>>>>>     - should we release 3 jars: one for the new, one for the new and one 
>>>>> for both ?
>>>>> 
>>>>>     - should we create a special entry to the capability matrix ?
>>>>> 
>>>>> WDYT ?
>>>>> Best
>>>>> 
>>>>> Etienne
>>>>> 
>>>>> On 23/10/2019 19:11, Mikhail Gryzykhin wrote:
>>>>>> +1 to merge.
>>>>>> 
>>>>>> It is worth keeping things in master with explicitly marked status. It 
>>>>>> will make effort more visible to users and easier to get feedback upon.
>>>>>> 
>>>>>> --Mikhail
>>>>>> 
>>>>>> On Wed, Oct 23, 2019 at 8:36 AM Etienne Chauchot <echauc...@apache.org 
>>>>>> <mailto:echauc...@apache.org>> wrote:
>>>>>> Hi guys,
>>>>>> 
>>>>>> The new spark runner now supports beam coders and passes 93% of the 
>>>>>> batch validates runner tests (+4%). I think it is time to merge it to 
>>>>>> master. I will submit a PR in the coming days.
>>>>>> 
>>>>>> next steps: support schemas and thus better leverage catalyst optimizer 
>>>>>> (among other things optims based on data), port perfs optims that were 
>>>>>> done in the current runner.
>>>>>> Best
>>>>>> Etienne
>>>>>> On 11/10/2019 22:48, Pablo Estrada wrote:
>>>>>>> +1 for merging : )
>>>>>>> 
>>>>>>> On Fri, Oct 11, 2019 at 12:43 PM Robert Bradshaw <rober...@google.com 
>>>>>>> <mailto:rober...@google.com>> wrote:
>>>>>>> Sounds like a good plan to me. 
>>>>>>> 
>>>>>>> On Fri, Oct 11, 2019 at 6:20 AM Etienne Chauchot <echauc...@apache.org 
>>>>>>> <mailto:echauc...@apache.org>> wrote:
>>>>>>> Comments inline
>>>>>>> On 10/10/2019 23:44, Ismaël Mejía wrote:
>>>>>>>> +1
>>>>>>>> 
>>>>>>>> The earlier we get to master the better to encourage not only code
>>>>>>>> contributions but as important to have early user feedback.
>>>>>>>> 
>>>>>>>>> Question is: do we keep the "old" spark runner for a while or not (or 
>>>>>>>>> just keep on previous version/tag on git) ?
>>>>>>>> It is still too early to even start discussing when to remove the
>>>>>>>> classical runner given that the new runner is still a WIP. However the
>>>>>>>> overall goal is that this runner becomes the de-facto one once the VR
>>>>>>>> tests and the performance become at least equal to the classical
>>>>>>>> runner, in the meantime the best for users is that they co-exist,
>>>>>>>> let’s not forget that the other runner has been already battle tested
>>>>>>>> for more than 3 years and has had lots of improvements in the last
>>>>>>>> year.
>>>>>>> +1 on what Ismael says: no soon removal, 
>>>>>>> The plan I had in mind at first (that I showed at the apacheCon) was 
>>>>>>> this but I'm proposing moving the first gray label to before the red 
>>>>>>> box. 
>>>>>>> <beogijnhpieapoll.png>
>>>>>>> 
>>>>>>> 
>>>>>>>>> I don't think the number of commits should be an issue--we shouldn't
>>>>>>>>> just squash years worth of history away. (OTOH, if this is a case of
>>>>>>>>> this branch containing lots of little, irrelevant commits that would
>>>>>>>>> have normally been squashed away in the normal review process we do
>>>>>>>>> for the main branch, then, yes, some cleanup could be nice.)
>>>>>>>> About the commits we should encourage a clear history but we have also
>>>>>>>> to remove useless commits that are still present in the branch,
>>>>>>>> commits of the “Fix errorprone” / “Cleaning” kind and even commits
>>>>>>>> that make a better narrative sense together should be probably
>>>>>>>> squashed, because they do not bring much to the history. It is not
>>>>>>>> about more or less commits it is about its relevance as Robert
>>>>>>>> mentions.
>>>>>>>> 
>>>>>>>>> I think our experiences with things that go to master early have been 
>>>>>>>>> very good. So I am in favor ASAP. We can exclude it from releases 
>>>>>>>>> easily until it is ready for end users.
>>>>>>>>> I have the same question as Robert - how much is modifications and 
>>>>>>>>> how much is new? I notice it is in a subdirectory of the 
>>>>>>>>> beam-runners-spark module.
>>>>>>>> In its current form we cannot exclude it but this relates to the other
>>>>>>>> question, so better to explain a bit of history: The new runner used
>>>>>>>> to live in its own module and subdirectory because it is a full blank
>>>>>>>> page rewrite and the decision was not to use any of the classical
>>>>>>>> runner classes to not be constrained by its evolution.
>>>>>>>> 
>>>>>>>> However the reason to put it back in the same module as a subdirectory
>>>>>>>> was to encourage early use, in more detail: The way you deploy spark
>>>>>>>> jobs today is usually by packaging and staging an uber jar (~200MB of
>>>>>>>> pure dependency joy) that contains the user pipeline classes, the
>>>>>>>> spark runner module and its dependencies. If we have two spark runners
>>>>>>>> in separate modules the user would need to repackage and redeploy
>>>>>>>> their pipelines every time they want to switch from the classical
>>>>>>>> Spark runner to the structured streaming runner which is painful and
>>>>>>>> time and space consuming compared with the one module approach where
>>>>>>>> they just change the name of the runner class and that’s it. The idea
>>>>>>>> here is to make easy for users to test the new runner, but at the same
>>>>>>>> time to make easy to come back to the classical runner in case of any
>>>>>>>> issue.
>>>>>>>> 
>>>>>>>> Ismaël
>>>>>>>> 
>>>>>>>> On Thu, Oct 10, 2019 at 9:02 PM Kenneth Knowles <k...@apache.org> 
>>>>>>>> <mailto:k...@apache.org> wrote:
>>>>>>>>> +1
>>>>>>>>> 
>>>>>>>>> I think our experiences with things that go to master early have been 
>>>>>>>>> very good. So I am in favor ASAP. We can exclude it from releases 
>>>>>>>>> easily until it is ready for end users.
>>>>>>>>> 
>>>>>>>>> I have the same question as Robert - how much is modifications and 
>>>>>>>>> how much is new? I notice it is in a subdirectory of the 
>>>>>>>>> beam-runners-spark module.
>>>>>>>>> 
>>>>>>>>> I did not see any major changes to dependencies but I will also ask 
>>>>>>>>> if it has major version differences so that you might want a separate 
>>>>>>>>> artifact?
>>>>>>>>> 
>>>>>>>>> Kenn
>>>>>>>>> 
>>>>>>>>> On Thu, Oct 10, 2019 at 11:50 AM Robert Bradshaw 
>>>>>>>>> <rober...@google.com> <mailto:rober...@google.com> wrote:
>>>>>>>>>> On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot 
>>>>>>>>>> <echauc...@apache.org> <mailto:echauc...@apache.org> wrote:
>>>>>>>>>>> Hi guys,
>>>>>>>>>>> 
>>>>>>>>>>> You probably know that there has been for several months an work
>>>>>>>>>>> developing a new Spark runner based on Spark Structured Streaming
>>>>>>>>>>> framework. This work is located in a feature branch here:
>>>>>>>>>>> https://github.com/apache/beam/tree/spark-runner_structured-streaming
>>>>>>>>>>>  
>>>>>>>>>>> <https://github.com/apache/beam/tree/spark-runner_structured-streaming>
>>>>>>>>>>> 
>>>>>>>>>>> To attract more contributors and get some user feedback, we think 
>>>>>>>>>>> it is
>>>>>>>>>>> time to merge it to master. Before doing so, some steps need to be 
>>>>>>>>>>> achieved:
>>>>>>>>>>> 
>>>>>>>>>>> - finish the work on spark Encoders (that allow to call Beam coders)
>>>>>>>>>>> because, right now, the runner is in an unstable state (some 
>>>>>>>>>>> transforms
>>>>>>>>>>> use the new way of doing ser/de and some use the old one, making a
>>>>>>>>>>> pipeline incoherent toward serialization)
>>>>>>>>>>> 
>>>>>>>>>>> - clean history: The history contains commits from November 2018, so
>>>>>>>>>>> there is a good amount of work, thus a consequent number of commits.
>>>>>>>>>>> They were already squashed but not from September 2019
>>>>>>>>>> I don't think the number of commits should be an issue--we shouldn't
>>>>>>>>>> just squash years worth of history away. (OTOH, if this is a case of
>>>>>>>>>> this branch containing lots of little, irrelevant commits that would
>>>>>>>>>> have normally been squashed away in the normal review process we do
>>>>>>>>>> for the main branch, then, yes, some cleanup could be nice.)
>>>>>>>>>> 
>>>>>>>>>>> Regarding status:
>>>>>>>>>>> 
>>>>>>>>>>> - the runner passes 89% of the validates runner tests in batch 
>>>>>>>>>>> mode. We
>>>>>>>>>>> hope to pass more with the new Encoders
>>>>>>>>>>> 
>>>>>>>>>>> - Streaming mode is barely started (waiting for the 
>>>>>>>>>>> multi-aggregations
>>>>>>>>>>> support in spark SS framework from the Spark community)
>>>>>>>>>>> 
>>>>>>>>>>> - Runner can execute Nexmark
>>>>>>>>>>> 
>>>>>>>>>>> - Some things are not wired up yet
>>>>>>>>>>> 
>>>>>>>>>>>      - Beam Schemas not wired with Spark Schemas
>>>>>>>>>>> 
>>>>>>>>>>>      - Optional features of the model not implemented:  state api, 
>>>>>>>>>>> timer
>>>>>>>>>>> api, splittable doFn api, …
>>>>>>>>>>> 
>>>>>>>>>>> WDYT, can we merge it to master once the 2 steps are done ?
>>>>>>>>>> I think that as long as it sits parallel to the existing runner, and
>>>>>>>>>> is clearly marked with its status, it makes sense to me. How many
>>>>>>>>>> changes does it make to the existing codebase (as opposed to add new
>>>>>>>>>> code)?
>>> 
>

Re: [spark structured streaming runner] merge to master?

Reply via email to