Re: [DISCUSSION] using NexMark for Beam

Kenneth Knowles Wed, 20 Sep 2017 20:42:07 -0700

IIRC it is also exciting in that the final window is smaller than the union
of its component windows. I think we have a JIRA open to decide if that is
even allowed.


On Wed, Sep 20, 2017 at 12:15 AM, Etienne Chauchot <[email protected]>
wrote:

> Hi
> Indeed, the query builds a lot of maps, it is thus expensive. I totally
> agree with your point, adding a query with a simple merge such as the one
> that is done in the ValidatesRunner bellow is a good idea. I'll add 2
> tickets, one for the migration of winningBids to state API and one for the
> creation of query 13 that illustrates a simple custom window merge.
>
> Etienne
>
>
> Le 19/09/2017 à 18:09, Reuven Lax a écrit :
>
>> On Tue, Sep 19, 2017 at 7:29 AM, Etienne Chauchot <[email protected]>
>> wrote:
>>
>> Hi all,
>>>
>>> I'm resuming my work on Nexmark a bit, starting to do some maintenance on
>>> the tickets
>>>
>>> @Reuven: I have some comments inline below.
>>>
>>> Le 14/05/2017 à 14:29, Reuven Lax a écrit :
>>>
>>> Great to hear! A couple of comments:
>>>>
>>>> When Query 10 was written, the file-based sinks did not supported
>>>> unbounded
>>>> input. Now that in Beam FileBasedSink supports windowed output files, I
>>>> think we should just rip out the custom IO code in Query 10 and replace
>>>> it
>>>> with AvroIO  - this is closer to what real Beam users will do, and it
>>>> will
>>>> also make it support HDFS.
>>>>
>>>> +1: I updated this ticket https://issues.apache.org/jira
>>> /browse/BEAM-2856
>>>
>>> Query 10 also tests some subtle semantics around late data - notably that
>>>> if an element from a source is not late, elements resulting from
>>>> processing
>>>> that element are not late. Essentially this is a correctness test for
>>>> watermarks, and should apply to all runners IMO.
>>>>
>>>> Yes I agree, but there is some ValidatesRunner tests around this, right?
>>> If not, we should create some IMHO.
>>>
>>> WinningBids.java (used in Query6) uses a fairly awkward (and
>>>> computationally expensive) custom merging window function - largely
>>>> because
>>>> Mark was trying to avoid using the state API as much as he could (at the
>>>> time there was no public state API). IMO we should rewrite WinningBids
>>>> to
>>>> use state. This should result in both cleaner code, and more efficient
>>>> query.
>>>>
>>>> I agree that this query is a bit awkward. But it is the only one in the
>>> query set that illustrates custom window merging. There is already query
>>> 3
>>> that illustrates the use of state API (I migrated it to use state API
>>> after
>>> Mark released it). Even if there is now a ValidatesRunner test on custom
>>> window merging ([1]), I believe it could be useful to keep WinningBids as
>>> it is to serve as benchmark of custom window merging in the runners.
>>>
>>> My memory wast that this was an awkward use of merging windows, as the
>> merge function was very expensive (building maps, etc.). As such, the cost
>> of the WinningBid merge function dominated, so it really just served as a
>> benchmark of how  often windows were merged (i.e. merge is called very
>> often in streaming runners an much less often in batch runners).
>>
>> I wonder if we're best off introducing a new query that more explicitly
>> tests merging windows, with a more-reasonable merging window fn.
>>
>
> WDYT?
>>>
>>> [1] https://github.com/apache/beam/blob/c65aca07faf7b8c4dabe6cae
>>> 7b5b52286d2b25b1/sdks/java/core/src/test/java/org/apache/
>>> beam/sdk/transforms/windowing/WindowTest.java#L591
>>>
>>> Best,
>>> Etienne
>>>
>>>
>>> Reuven
>>>>
>>>> On Sun, May 14, 2017 at 3:09 AM, Ismaël Mejía <[email protected]>
>>>> wrote:
>>>>
>>>> Hello,
>>>>
>>>>> Thanks Etienne for opening the Pull Request and starting the
>>>>> discussion for the review process. I also want to thank publicly all
>>>>> the people that somehow contributed to this:
>>>>>
>>>>> - Mark Shields and the original people at google who worked at nexmark
>>>>> for contributing this in the first place.
>>>>> - Etienne because his work and constant help really improved the
>>>>> status of the queries, your work on query 3 was really nice, and also
>>>>> for the hard work of helping me test all the queries with all the
>>>>> runners and ping the runner maintainers for fixes.
>>>>> - Aviem/Amit for all the help to solve the issues with the spark
>>>>> runner whose support is now almost feature complete (even in
>>>>> streaming!).
>>>>> - Aljoscha/Jinsong for the fix to merge IntervalWindowFn and for
>>>>> quickly adding the support for metrics.
>>>>> - Thomas Groh and Kenneth for fixing some needed parts in Direct
>>>>> Runner + answering our questions on the State/Timer API.
>>>>> - JB and the talend crew for all the feedback and help to run in our
>>>>> benchmark cluster.
>>>>> - And of course the rest of the Beam community :)
>>>>>
>>>>> Some comments:
>>>>>
>>>>> - This does not need to have a feature branch since we have been
>>>>> working on this in a fork for months now and with the stable API we
>>>>> can simply do a traditional PR review. Of course the review is a bit
>>>>> bigger so we expect it to take some time, but I hope we can get some
>>>>> quick progress once FSR is out.
>>>>>
>>>>> - We need a hand from the google guys, for the moment we have tested
>>>>> all the queries in all the runners, but not in the Dataflow runner
>>>>> because we don't have access to it (well we have but not with the
>>>>> freedom that you guys have to run the benchmark at will), so if we can
>>>>> get some access that would be nice or if this is not possible, it
>>>>> would be nice if some of you guys help us test/report any given issue
>>>>> on this runner,
>>>>>
>>>>> - We also have to decide the future of some features, this is probably
>>>>> independent of the current PR and part of the evolution of Nexmark on
>>>>> Beam:
>>>>>
>>>>> -- There are still some pending things that can be improved even after
>>>>> the review once in master, e.g. we have for the moment only synthetic
>>>>> sources but the original version took also data from Pubsub, we have
>>>>> to define the correct scope for this and given the case also add other
>>>>> sources, e.g. Kafka, HDFS.
>>>>>
>>>>> -- Query 10 is really oriented to testing Google Runner/IOs specific
>>>>> features, so we have to decide what to do with this one, maybe
>>>>> mirroring it with Kafka/HDFS to have something equivalent in the
>>>>> Apache world.
>>>>>
>>>>> This is all for now, I am really glad that this is finally happening
>>>>> and I hope this soon gets merged.
>>>>>
>>>>> Ismaël
>>>>>
>>>>> On Fri, May 12, 2017 at 6:07 PM, Lukasz Cwik <[email protected]
>>>>> >
>>>>> wrote:
>>>>>
>>>>> I think these are valuable enough that we should get them into
>>>>>>
>>>>>> apache/master
>>>>>
>>>>> On Fri, May 12, 2017 at 4:34 AM, Jean-Baptiste Onofré <[email protected]
>>>>>> >
>>>>>> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>> PR or even a feature branch could work. Up to you.
>>>>>>>
>>>>>>> Regards
>>>>>>> JB
>>>>>>>
>>>>>>>
>>>>>>> On 05/12/2017 10:55 AM, Etienne Chauchot wrote:
>>>>>>>
>>>>>>> Hi guys,
>>>>>>>
>>>>>>>> I wanted to let you know that I have just submitted a PR around
>>>>>>>>
>>>>>>>> NexMark.
>>>>>>>
>>>>>> This is
>>>>>>
>>>>>>> a port of the NexMark queries to Beam, to be used as integration
>>>>>>>> tests.
>>>>>>>> This can also be used as A-B testing (no-regression or performance
>>>>>>>> comparison
>>>>>>>> between 2 versions of the same engine or of the same runner)
>>>>>>>>
>>>>>>>> This a continuation of the previous PR (#99) from Mark Shields.
>>>>>>>> The code has changed quite a bit: some queries have changed to use
>>>>>>>> new
>>>>>>>> Beam APIs
>>>>>>>> and there where some big refactorings. More important, we can now
>>>>>>>> run
>>>>>>>>
>>>>>>>> all
>>>>>>>
>>>>>> the
>>>>>>
>>>>>>> queries in all the runners.
>>>>>>>>
>>>>>>>> Nevertheless, there are still some open issues in Nexmark
>>>>>>>> (https://github.com/iemejia/beam/issues) and in Beam upstream (see
>>>>>>>>
>>>>>>>> issue
>>>>>>>
>>>>>> links
>>>>>>
>>>>>>> in https://issues.apache.org/jira/browse/BEAM-160)
>>>>>>>>
>>>>>>>> I wanted to submit the PR before our (Ismaël and I) NexMark talk at
>>>>>>>> the
>>>>>>>> ApacheCon. The PR is not perfect but it is in a good shape to share
>>>>>>>> it.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>> Etienne
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Le 22/03/2017 à 04:51, Kenneth Knowles a écrit :
>>>>>>>>
>>>>>>>> This is great! Having a variety of realistic-ish pipelines running
>>>>>>>> on
>>>>>>>> all
>>>>>>>>
>>>>>>> runners complements the validation suite and IO IT work.
>>>>>>
>>>>>>> If I recall, some of these involve heavy and esoteric uses of state,
>>>>>>>>>
>>>>>>>>> so
>>>>>>>>
>>>>>>> definitely give me a ping if you hit any trouble.
>>>>>>
>>>>>>> Kenn
>>>>>>>>>
>>>>>>>>> On Tue, Mar 21, 2017 at 9:38 AM, Etienne Chauchot <
>>>>>>>>>
>>>>>>>>> [email protected]>
>>>>>>>>
>>>>>>> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> Ismael and I are working on upgrading the Nexmark implementation
>>>>>>>>>> for
>>>>>>>>>> Beam.
>>>>>>>>>> See https://github.com/iemejia/beam/tree/BEAM-160-nexmark and
>>>>>>>>>> https://issues.apache.org/jira/browse/BEAM-160. We are continuing
>>>>>>>>>>
>>>>>>>>>> the
>>>>>>>>>
>>>>>>>> work done by Mark Shields. See https://github.com/apache/
>>>>>>
>>>>>>> beam/pull/366
>>>>>>>>>
>>>>>>>> for the original PR.
>>>>>>
>>>>>>> The PR contains queries that have a wide coverage of the Beam model
>>>>>>>>>>
>>>>>>>>>> and
>>>>>>>>>
>>>>>>>> that represent a realistic end user use case (some come from client
>>>>>>
>>>>>>> experience on Google Cloud Dataflow).
>>>>>>>>>>
>>>>>>>>>> So far, we have upgraded the implementation to the latest Beam
>>>>>>>>>>
>>>>>>>>>> snapshot.
>>>>>>>>>
>>>>>>>> And we are able to execute a good subset of the queries in the
>>>>>>
>>>>>>> different
>>>>>>>>>
>>>>>>>> runners. We upgraded the nexmark drivers to do so: direct driver
>>>>>>
>>>>>>> (upgraded
>>>>>>>>>> from inProcessDriver) and flink driver and we added a new one for
>>>>>>>>>>
>>>>>>>>>> spark.
>>>>>>>>>
>>>>>>>> There is still a good amount of work to do and we would like to know
>>>>>>
>>>>>>> if
>>>>>>>>>
>>>>>>>> you think that this contribution can have its place into Beam
>>>>>>
>>>>>>> eventually.
>>>>>>>>>>
>>>>>>>>>> The interests of having Nexmark on Beam that we have seen so far
>>>>>>>>>> are:
>>>>>>>>>>
>>>>>>>>>> - Rich batch/streaming test
>>>>>>>>>>
>>>>>>>>>> - A-B testing of runners or runtimes (non-regression, performance
>>>>>>>>>> comparison between versions ...)
>>>>>>>>>>
>>>>>>>>>> - Integration testing (sdk/runners, runner/runtime, ...)
>>>>>>>>>>
>>>>>>>>>> - Validate beam capability matrix
>>>>>>>>>>
>>>>>>>>>> - It can be used as part of the ongoing PerfKit work (if there is
>>>>>>>>>> any
>>>>>>>>>> interest).
>>>>>>>>>>
>>>>>>>>>> As a final note, we are tracking the issues in the same repo. If
>>>>>>>>>>
>>>>>>>>>> someone
>>>>>>>>>
>>>>>>>> is interested in contributing, or have more ideas, you are welcome
>>>>>> :)
>>>>>>
>>>>>>> Etienne
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>> Jean-Baptiste Onofré
>>>>>>> [email protected]
>>>>>>> http://blog.nanthrax.net
>>>>>>> Talend - http://www.talend.com
>>>>>>>
>>>>>>>
>>>>>>>
>

Re: [DISCUSSION] using NexMark for Beam

Reply via email to