Re: [spark structured streaming runner] merge to master?

Alexey Romanenko Mon, 28 Oct 2019 09:35:24 -0700

Let me share some of my thoughts on this.
>>     - shall we filter out the package name from the release? 
>> 
Until new runner is not ready to be used in production (or, at least, be used 
for beta testing but users should be clearly warned about that in this case), I 
believe we need to filter out its classes from published jar to avoid a 
confusion.
>>     - should we release 2 jars: one for the old and one for the new ? 
>> 
>>     - should we release 3 jars: one for the new, one for the new and one for 
>> both ?
>> 
Once new runner will be released, then I think we need to provide only one 
single jar and allow user to switch between different Spark runners with CLI 
option.
>>     - should we create a special entry to the capability matrix ?
>>


Sure, since it has its own uniq characteristics and implementation, but again, 
only once new runner will be "officially released".


> On 28 Oct 2019, at 10:27, Etienne Chauchot <[email protected]> wrote:
> 
> Hi guys,
> 
> Any opinions on the point2 communication to users ?
> 
> Etienne
> On 24/10/2019 15:44, Etienne Chauchot wrote:
>> Hi guys,
>> 
>> I'm glad to announce that the PR for the merge to master of the new runner 
>> based on Spark Structured Streaming framework is submitted:
>> 
>> https://github.com/apache/beam/pull/9866 
>> <https://github.com/apache/beam/pull/9866>
>> 
>> 1. Regarding the status of the runner: 
>> -the runner passes 93% of the validates runner tests in batch mode.
>> 
>> -Streaming mode is barely started (waiting for the multi-aggregations 
>> support in spark Structured Streaming framework from the Spark community)
>> 
>> -Runner can execute Nexmark
>> 
>> -Some things are not wired up yet
>> 
>>   -Beam Schemas not wired with Spark Schemas
>> 
>>   -Optional features of the model not implemented: state api, timer api, 
>> splittable doFn api, …
>> 
>> 2. Regarding the communication to users:
>> 
>> - for reasons explained by Ismael: the runner is in the same module as the 
>> "older" one. But it is in a different sub-package and both runners share the 
>> same build.  
>> - How should we communicate to users: 
>>     - shall we filter out the package name from the release? 
>>     - should we release 2 jars: one for the old and one for the new ? 
>>     - should we release 3 jars: one for the new, one for the new and one for 
>> both ?
>> 
>>     - should we create a special entry to the capability matrix ?
>> 
>> WDYT ?
>> Best
>> 
>> Etienne
>> 
>> On 23/10/2019 19:11, Mikhail Gryzykhin wrote:
>>> +1 to merge.
>>> 
>>> It is worth keeping things in master with explicitly marked status. It will 
>>> make effort more visible to users and easier to get feedback upon.
>>> 
>>> --Mikhail
>>> 
>>> On Wed, Oct 23, 2019 at 8:36 AM Etienne Chauchot <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> Hi guys,
>>> 
>>> The new spark runner now supports beam coders and passes 93% of the batch 
>>> validates runner tests (+4%). I think it is time to merge it to master. I 
>>> will submit a PR in the coming days.
>>> 
>>> next steps: support schemas and thus better leverage catalyst optimizer 
>>> (among other things optims based on data), port perfs optims that were done 
>>> in the current runner.
>>> Best
>>> Etienne
>>> On 11/10/2019 22:48, Pablo Estrada wrote:
>>>> +1 for merging : )
>>>> 
>>>> On Fri, Oct 11, 2019 at 12:43 PM Robert Bradshaw <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> Sounds like a good plan to me. 
>>>> 
>>>> On Fri, Oct 11, 2019 at 6:20 AM Etienne Chauchot <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> Comments inline
>>>> On 10/10/2019 23:44, Ismaël Mejía wrote:
>>>>> +1
>>>>> 
>>>>> The earlier we get to master the better to encourage not only code
>>>>> contributions but as important to have early user feedback.
>>>>> 
>>>>>> Question is: do we keep the "old" spark runner for a while or not (or 
>>>>>> just keep on previous version/tag on git) ?
>>>>> It is still too early to even start discussing when to remove the
>>>>> classical runner given that the new runner is still a WIP. However the
>>>>> overall goal is that this runner becomes the de-facto one once the VR
>>>>> tests and the performance become at least equal to the classical
>>>>> runner, in the meantime the best for users is that they co-exist,
>>>>> let’s not forget that the other runner has been already battle tested
>>>>> for more than 3 years and has had lots of improvements in the last
>>>>> year.
>>>> +1 on what Ismael says: no soon removal, 
>>>> The plan I had in mind at first (that I showed at the apacheCon) was this 
>>>> but I'm proposing moving the first gray label to before the red box. 
>>>> <beogijnhpieapoll.png>
>>>> 
>>>> 
>>>>>> I don't think the number of commits should be an issue--we shouldn't
>>>>>> just squash years worth of history away. (OTOH, if this is a case of
>>>>>> this branch containing lots of little, irrelevant commits that would
>>>>>> have normally been squashed away in the normal review process we do
>>>>>> for the main branch, then, yes, some cleanup could be nice.)
>>>>> About the commits we should encourage a clear history but we have also
>>>>> to remove useless commits that are still present in the branch,
>>>>> commits of the “Fix errorprone” / “Cleaning” kind and even commits
>>>>> that make a better narrative sense together should be probably
>>>>> squashed, because they do not bring much to the history. It is not
>>>>> about more or less commits it is about its relevance as Robert
>>>>> mentions.
>>>>> 
>>>>>> I think our experiences with things that go to master early have been 
>>>>>> very good. So I am in favor ASAP. We can exclude it from releases easily 
>>>>>> until it is ready for end users.
>>>>>> I have the same question as Robert - how much is modifications and how 
>>>>>> much is new? I notice it is in a subdirectory of the beam-runners-spark 
>>>>>> module.
>>>>> In its current form we cannot exclude it but this relates to the other
>>>>> question, so better to explain a bit of history: The new runner used
>>>>> to live in its own module and subdirectory because it is a full blank
>>>>> page rewrite and the decision was not to use any of the classical
>>>>> runner classes to not be constrained by its evolution.
>>>>> 
>>>>> However the reason to put it back in the same module as a subdirectory
>>>>> was to encourage early use, in more detail: The way you deploy spark
>>>>> jobs today is usually by packaging and staging an uber jar (~200MB of
>>>>> pure dependency joy) that contains the user pipeline classes, the
>>>>> spark runner module and its dependencies. If we have two spark runners
>>>>> in separate modules the user would need to repackage and redeploy
>>>>> their pipelines every time they want to switch from the classical
>>>>> Spark runner to the structured streaming runner which is painful and
>>>>> time and space consuming compared with the one module approach where
>>>>> they just change the name of the runner class and that’s it. The idea
>>>>> here is to make easy for users to test the new runner, but at the same
>>>>> time to make easy to come back to the classical runner in case of any
>>>>> issue.
>>>>> 
>>>>> Ismaël
>>>>> 
>>>>> On Thu, Oct 10, 2019 at 9:02 PM Kenneth Knowles <[email protected]> 
>>>>> <mailto:[email protected]> wrote:
>>>>>> +1
>>>>>> 
>>>>>> I think our experiences with things that go to master early have been 
>>>>>> very good. So I am in favor ASAP. We can exclude it from releases easily 
>>>>>> until it is ready for end users.
>>>>>> 
>>>>>> I have the same question as Robert - how much is modifications and how 
>>>>>> much is new? I notice it is in a subdirectory of the beam-runners-spark 
>>>>>> module.
>>>>>> 
>>>>>> I did not see any major changes to dependencies but I will also ask if 
>>>>>> it has major version differences so that you might want a separate 
>>>>>> artifact?
>>>>>> 
>>>>>> Kenn
>>>>>> 
>>>>>> On Thu, Oct 10, 2019 at 11:50 AM Robert Bradshaw <[email protected]> 
>>>>>> <mailto:[email protected]> wrote:
>>>>>>> On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot 
>>>>>>> <[email protected]> <mailto:[email protected]> wrote:
>>>>>>>> Hi guys,
>>>>>>>> 
>>>>>>>> You probably know that there has been for several months an work
>>>>>>>> developing a new Spark runner based on Spark Structured Streaming
>>>>>>>> framework. This work is located in a feature branch here:
>>>>>>>> https://github.com/apache/beam/tree/spark-runner_structured-streaming 
>>>>>>>> <https://github.com/apache/beam/tree/spark-runner_structured-streaming>
>>>>>>>> 
>>>>>>>> To attract more contributors and get some user feedback, we think it is
>>>>>>>> time to merge it to master. Before doing so, some steps need to be 
>>>>>>>> achieved:
>>>>>>>> 
>>>>>>>> - finish the work on spark Encoders (that allow to call Beam coders)
>>>>>>>> because, right now, the runner is in an unstable state (some transforms
>>>>>>>> use the new way of doing ser/de and some use the old one, making a
>>>>>>>> pipeline incoherent toward serialization)
>>>>>>>> 
>>>>>>>> - clean history: The history contains commits from November 2018, so
>>>>>>>> there is a good amount of work, thus a consequent number of commits.
>>>>>>>> They were already squashed but not from September 2019
>>>>>>> I don't think the number of commits should be an issue--we shouldn't
>>>>>>> just squash years worth of history away. (OTOH, if this is a case of
>>>>>>> this branch containing lots of little, irrelevant commits that would
>>>>>>> have normally been squashed away in the normal review process we do
>>>>>>> for the main branch, then, yes, some cleanup could be nice.)
>>>>>>> 
>>>>>>>> Regarding status:
>>>>>>>> 
>>>>>>>> - the runner passes 89% of the validates runner tests in batch mode. We
>>>>>>>> hope to pass more with the new Encoders
>>>>>>>> 
>>>>>>>> - Streaming mode is barely started (waiting for the multi-aggregations
>>>>>>>> support in spark SS framework from the Spark community)
>>>>>>>> 
>>>>>>>> - Runner can execute Nexmark
>>>>>>>> 
>>>>>>>> - Some things are not wired up yet
>>>>>>>> 
>>>>>>>>      - Beam Schemas not wired with Spark Schemas
>>>>>>>> 
>>>>>>>>      - Optional features of the model not implemented:  state api, 
>>>>>>>> timer
>>>>>>>> api, splittable doFn api, …
>>>>>>>> 
>>>>>>>> WDYT, can we merge it to master once the 2 steps are done ?
>>>>>>> I think that as long as it sits parallel to the existing runner, and
>>>>>>> is clearly marked with its status, it makes sense to me. How many
>>>>>>> changes does it make to the existing codebase (as opposed to add new
>>>>>>> code)?

Re: [spark structured streaming runner] merge to master?

Reply via email to