Re: [DISCUSSION] using NexMark for Beam

Jean-Baptiste Onofré Tue, 19 Sep 2017 09:39:07 -0700

I think adding a new query for window merging makes sense and would be easier 
to follow and maintain.


Regards
JB

On Sep 19, 2017, 18:09, at 18:09, Reuven Lax <re...@google.com.INVALID> wrote:
>On Tue, Sep 19, 2017 at 7:29 AM, Etienne Chauchot <echauc...@gmail.com>
>wrote:
>
>> Hi all,
>>
>> I'm resuming my work on Nexmark a bit, starting to do some
>maintenance on
>> the tickets
>>
>> @Reuven: I have some comments inline below.
>>
>> Le 14/05/2017 à 14:29, Reuven Lax a écrit :
>>
>>> Great to hear! A couple of comments:
>>>
>>> When Query 10 was written, the file-based sinks did not supported
>>> unbounded
>>> input. Now that in Beam FileBasedSink supports windowed output
>files, I
>>> think we should just rip out the custom IO code in Query 10 and
>replace it
>>> with AvroIO  - this is closer to what real Beam users will do, and
>it will
>>> also make it support HDFS.
>>>
>> +1: I updated this ticket
>https://issues.apache.org/jira/browse/BEAM-2856
>>
>>>
>>> Query 10 also tests some subtle semantics around late data - notably
>that
>>> if an element from a source is not late, elements resulting from
>>> processing
>>> that element are not late. Essentially this is a correctness test
>for
>>> watermarks, and should apply to all runners IMO.
>>>
>> Yes I agree, but there is some ValidatesRunner tests around this,
>right?
>> If not, we should create some IMHO.
>>
>>>
>>> WinningBids.java (used in Query6) uses a fairly awkward (and
>>> computationally expensive) custom merging window function - largely
>>> because
>>> Mark was trying to avoid using the state API as much as he could (at
>the
>>> time there was no public state API). IMO we should rewrite
>WinningBids to
>>> use state. This should result in both cleaner code, and more
>efficient
>>> query.
>>>
>> I agree that this query is a bit awkward. But it is the only one in
>the
>> query set that illustrates custom window merging. There is already
>query 3
>> that illustrates the use of state API (I migrated it to use state API
>after
>> Mark released it). Even if there is now a ValidatesRunner test on
>custom
>> window merging ([1]), I believe it could be useful to keep
>WinningBids as
>> it is to serve as benchmark of custom window merging in the runners.
>>
>
>My memory wast that this was an awkward use of merging windows, as the
>merge function was very expensive (building maps, etc.). As such, the
>cost
>of the WinningBid merge function dominated, so it really just served as
>a
>benchmark of how  often windows were merged (i.e. merge is called very
>often in streaming runners an much less often in batch runners).
>
>I wonder if we're best off introducing a new query that more explicitly
>tests merging windows, with a more-reasonable merging window fn.
>
>
>> WDYT?
>>
>> [1] https://github.com/apache/beam/blob/c65aca07faf7b8c4dabe6cae
>> 7b5b52286d2b25b1/sdks/java/core/src/test/java/org/apache/
>> beam/sdk/transforms/windowing/WindowTest.java#L591
>>
>> Best,
>> Etienne
>>
>>
>>> Reuven
>>>
>>> On Sun, May 14, 2017 at 3:09 AM, Ismaël Mejía <ieme...@gmail.com>
>wrote:
>>>
>>> Hello,
>>>>
>>>> Thanks Etienne for opening the Pull Request and starting the
>>>> discussion for the review process. I also want to thank publicly
>all
>>>> the people that somehow contributed to this:
>>>>
>>>> - Mark Shields and the original people at google who worked at
>nexmark
>>>> for contributing this in the first place.
>>>> - Etienne because his work and constant help really improved the
>>>> status of the queries, your work on query 3 was really nice, and
>also
>>>> for the hard work of helping me test all the queries with all the
>>>> runners and ping the runner maintainers for fixes.
>>>> - Aviem/Amit for all the help to solve the issues with the spark
>>>> runner whose support is now almost feature complete (even in
>>>> streaming!).
>>>> - Aljoscha/Jinsong for the fix to merge IntervalWindowFn and for
>>>> quickly adding the support for metrics.
>>>> - Thomas Groh and Kenneth for fixing some needed parts in Direct
>>>> Runner + answering our questions on the State/Timer API.
>>>> - JB and the talend crew for all the feedback and help to run in
>our
>>>> benchmark cluster.
>>>> - And of course the rest of the Beam community :)
>>>>
>>>> Some comments:
>>>>
>>>> - This does not need to have a feature branch since we have been
>>>> working on this in a fork for months now and with the stable API we
>>>> can simply do a traditional PR review. Of course the review is a
>bit
>>>> bigger so we expect it to take some time, but I hope we can get
>some
>>>> quick progress once FSR is out.
>>>>
>>>> - We need a hand from the google guys, for the moment we have
>tested
>>>> all the queries in all the runners, but not in the Dataflow runner
>>>> because we don't have access to it (well we have but not with the
>>>> freedom that you guys have to run the benchmark at will), so if we
>can
>>>> get some access that would be nice or if this is not possible, it
>>>> would be nice if some of you guys help us test/report any given
>issue
>>>> on this runner,
>>>>
>>>> - We also have to decide the future of some features, this is
>probably
>>>> independent of the current PR and part of the evolution of Nexmark
>on
>>>> Beam:
>>>>
>>>> -- There are still some pending things that can be improved even
>after
>>>> the review once in master, e.g. we have for the moment only
>synthetic
>>>> sources but the original version took also data from Pubsub, we
>have
>>>> to define the correct scope for this and given the case also add
>other
>>>> sources, e.g. Kafka, HDFS.
>>>>
>>>> -- Query 10 is really oriented to testing Google Runner/IOs
>specific
>>>> features, so we have to decide what to do with this one, maybe
>>>> mirroring it with Kafka/HDFS to have something equivalent in the
>>>> Apache world.
>>>>
>>>> This is all for now, I am really glad that this is finally
>happening
>>>> and I hope this soon gets merged.
>>>>
>>>> Ismaël
>>>>
>>>> On Fri, May 12, 2017 at 6:07 PM, Lukasz Cwik
><lc...@google.com.invalid>
>>>> wrote:
>>>>
>>>>> I think these are valuable enough that we should get them into
>>>>>
>>>> apache/master
>>>>
>>>>> On Fri, May 12, 2017 at 4:34 AM, Jean-Baptiste Onofré
><j...@nanthrax.net>
>>>>> wrote:
>>>>>
>>>>> Hi,
>>>>>>
>>>>>> PR or even a feature branch could work. Up to you.
>>>>>>
>>>>>> Regards
>>>>>> JB
>>>>>>
>>>>>>
>>>>>> On 05/12/2017 10:55 AM, Etienne Chauchot wrote:
>>>>>>
>>>>>> Hi guys,
>>>>>>>
>>>>>>> I wanted to let you know that I have just submitted a PR around
>>>>>>>
>>>>>> NexMark.
>>>>
>>>>> This is
>>>>>>> a port of the NexMark queries to Beam, to be used as integration
>>>>>>> tests.
>>>>>>> This can also be used as A-B testing (no-regression or
>performance
>>>>>>> comparison
>>>>>>> between 2 versions of the same engine or of the same runner)
>>>>>>>
>>>>>>> This a continuation of the previous PR (#99) from Mark Shields.
>>>>>>> The code has changed quite a bit: some queries have changed to
>use new
>>>>>>> Beam APIs
>>>>>>> and there where some big refactorings. More important, we can
>now run
>>>>>>>
>>>>>> all
>>>>
>>>>> the
>>>>>>> queries in all the runners.
>>>>>>>
>>>>>>> Nevertheless, there are still some open issues in Nexmark
>>>>>>> (https://github.com/iemejia/beam/issues) and in Beam upstream
>(see
>>>>>>>
>>>>>> issue
>>>>
>>>>> links
>>>>>>> in https://issues.apache.org/jira/browse/BEAM-160)
>>>>>>>
>>>>>>> I wanted to submit the PR before our (Ismaël and I) NexMark talk
>at
>>>>>>> the
>>>>>>> ApacheCon. The PR is not perfect but it is in a good shape to
>share
>>>>>>> it.
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Etienne
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Le 22/03/2017 à 04:51, Kenneth Knowles a écrit :
>>>>>>>
>>>>>>> This is great! Having a variety of realistic-ish pipelines
>running on
>>>>>>>>
>>>>>>> all
>>>>
>>>>> runners complements the validation suite and IO IT work.
>>>>>>>>
>>>>>>>> If I recall, some of these involve heavy and esoteric uses of
>state,
>>>>>>>>
>>>>>>> so
>>>>
>>>>> definitely give me a ping if you hit any trouble.
>>>>>>>>
>>>>>>>> Kenn
>>>>>>>>
>>>>>>>> On Tue, Mar 21, 2017 at 9:38 AM, Etienne Chauchot <
>>>>>>>>
>>>>>>> echauc...@gmail.com>
>>>>
>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>>> Ismael and I are working on upgrading the Nexmark
>implementation for
>>>>>>>>> Beam.
>>>>>>>>> See https://github.com/iemejia/beam/tree/BEAM-160-nexmark and
>>>>>>>>> https://issues.apache.org/jira/browse/BEAM-160. We are
>continuing
>>>>>>>>>
>>>>>>>> the
>>>>
>>>>> work done by Mark Shields. See https://github.com/apache/
>>>>>>>>>
>>>>>>>> beam/pull/366
>>>>
>>>>> for the original PR.
>>>>>>>>>
>>>>>>>>> The PR contains queries that have a wide coverage of the Beam
>model
>>>>>>>>>
>>>>>>>> and
>>>>
>>>>> that represent a realistic end user use case (some come from
>client
>>>>>>>>> experience on Google Cloud Dataflow).
>>>>>>>>>
>>>>>>>>> So far, we have upgraded the implementation to the latest Beam
>>>>>>>>>
>>>>>>>> snapshot.
>>>>
>>>>> And we are able to execute a good subset of the queries in the
>>>>>>>>>
>>>>>>>> different
>>>>
>>>>> runners. We upgraded the nexmark drivers to do so: direct driver
>>>>>>>>> (upgraded
>>>>>>>>> from inProcessDriver) and flink driver and we added a new one
>for
>>>>>>>>>
>>>>>>>> spark.
>>>>
>>>>> There is still a good amount of work to do and we would like to
>know
>>>>>>>>>
>>>>>>>> if
>>>>
>>>>> you think that this contribution can have its place into Beam
>>>>>>>>> eventually.
>>>>>>>>>
>>>>>>>>> The interests of having Nexmark on Beam that we have seen so
>far
>>>>>>>>> are:
>>>>>>>>>
>>>>>>>>> - Rich batch/streaming test
>>>>>>>>>
>>>>>>>>> - A-B testing of runners or runtimes (non-regression,
>performance
>>>>>>>>> comparison between versions ...)
>>>>>>>>>
>>>>>>>>> - Integration testing (sdk/runners, runner/runtime, ...)
>>>>>>>>>
>>>>>>>>> - Validate beam capability matrix
>>>>>>>>>
>>>>>>>>> - It can be used as part of the ongoing PerfKit work (if there
>is
>>>>>>>>> any
>>>>>>>>> interest).
>>>>>>>>>
>>>>>>>>> As a final note, we are tracking the issues in the same repo.
>If
>>>>>>>>>
>>>>>>>> someone
>>>>
>>>>> is interested in contributing, or have more ideas, you are welcome
>:)
>>>>>>>>>
>>>>>>>>> Etienne
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>> Jean-Baptiste Onofré
>>>>>> jbono...@apache.org
>>>>>> http://blog.nanthrax.net
>>>>>> Talend - http://www.talend.com
>>>>>>
>>>>>>
>>

Re: [DISCUSSION] using NexMark for Beam

Reply via email to