I think adding a new query for window merging makes sense and would be easier to follow and maintain.
Regards JB On Sep 19, 2017, 18:09, at 18:09, Reuven Lax <re...@google.com.INVALID> wrote: >On Tue, Sep 19, 2017 at 7:29 AM, Etienne Chauchot <echauc...@gmail.com> >wrote: > >> Hi all, >> >> I'm resuming my work on Nexmark a bit, starting to do some >maintenance on >> the tickets >> >> @Reuven: I have some comments inline below. >> >> Le 14/05/2017 à 14:29, Reuven Lax a écrit : >> >>> Great to hear! A couple of comments: >>> >>> When Query 10 was written, the file-based sinks did not supported >>> unbounded >>> input. Now that in Beam FileBasedSink supports windowed output >files, I >>> think we should just rip out the custom IO code in Query 10 and >replace it >>> with AvroIO - this is closer to what real Beam users will do, and >it will >>> also make it support HDFS. >>> >> +1: I updated this ticket >https://issues.apache.org/jira/browse/BEAM-2856 >> >>> >>> Query 10 also tests some subtle semantics around late data - notably >that >>> if an element from a source is not late, elements resulting from >>> processing >>> that element are not late. Essentially this is a correctness test >for >>> watermarks, and should apply to all runners IMO. >>> >> Yes I agree, but there is some ValidatesRunner tests around this, >right? >> If not, we should create some IMHO. >> >>> >>> WinningBids.java (used in Query6) uses a fairly awkward (and >>> computationally expensive) custom merging window function - largely >>> because >>> Mark was trying to avoid using the state API as much as he could (at >the >>> time there was no public state API). IMO we should rewrite >WinningBids to >>> use state. This should result in both cleaner code, and more >efficient >>> query. >>> >> I agree that this query is a bit awkward. But it is the only one in >the >> query set that illustrates custom window merging. There is already >query 3 >> that illustrates the use of state API (I migrated it to use state API >after >> Mark released it). Even if there is now a ValidatesRunner test on >custom >> window merging ([1]), I believe it could be useful to keep >WinningBids as >> it is to serve as benchmark of custom window merging in the runners. >> > >My memory wast that this was an awkward use of merging windows, as the >merge function was very expensive (building maps, etc.). As such, the >cost >of the WinningBid merge function dominated, so it really just served as >a >benchmark of how often windows were merged (i.e. merge is called very >often in streaming runners an much less often in batch runners). > >I wonder if we're best off introducing a new query that more explicitly >tests merging windows, with a more-reasonable merging window fn. > > >> WDYT? >> >> [1] https://github.com/apache/beam/blob/c65aca07faf7b8c4dabe6cae >> 7b5b52286d2b25b1/sdks/java/core/src/test/java/org/apache/ >> beam/sdk/transforms/windowing/WindowTest.java#L591 >> >> Best, >> Etienne >> >> >>> Reuven >>> >>> On Sun, May 14, 2017 at 3:09 AM, Ismaël Mejía <ieme...@gmail.com> >wrote: >>> >>> Hello, >>>> >>>> Thanks Etienne for opening the Pull Request and starting the >>>> discussion for the review process. I also want to thank publicly >all >>>> the people that somehow contributed to this: >>>> >>>> - Mark Shields and the original people at google who worked at >nexmark >>>> for contributing this in the first place. >>>> - Etienne because his work and constant help really improved the >>>> status of the queries, your work on query 3 was really nice, and >also >>>> for the hard work of helping me test all the queries with all the >>>> runners and ping the runner maintainers for fixes. >>>> - Aviem/Amit for all the help to solve the issues with the spark >>>> runner whose support is now almost feature complete (even in >>>> streaming!). >>>> - Aljoscha/Jinsong for the fix to merge IntervalWindowFn and for >>>> quickly adding the support for metrics. >>>> - Thomas Groh and Kenneth for fixing some needed parts in Direct >>>> Runner + answering our questions on the State/Timer API. >>>> - JB and the talend crew for all the feedback and help to run in >our >>>> benchmark cluster. >>>> - And of course the rest of the Beam community :) >>>> >>>> Some comments: >>>> >>>> - This does not need to have a feature branch since we have been >>>> working on this in a fork for months now and with the stable API we >>>> can simply do a traditional PR review. Of course the review is a >bit >>>> bigger so we expect it to take some time, but I hope we can get >some >>>> quick progress once FSR is out. >>>> >>>> - We need a hand from the google guys, for the moment we have >tested >>>> all the queries in all the runners, but not in the Dataflow runner >>>> because we don't have access to it (well we have but not with the >>>> freedom that you guys have to run the benchmark at will), so if we >can >>>> get some access that would be nice or if this is not possible, it >>>> would be nice if some of you guys help us test/report any given >issue >>>> on this runner, >>>> >>>> - We also have to decide the future of some features, this is >probably >>>> independent of the current PR and part of the evolution of Nexmark >on >>>> Beam: >>>> >>>> -- There are still some pending things that can be improved even >after >>>> the review once in master, e.g. we have for the moment only >synthetic >>>> sources but the original version took also data from Pubsub, we >have >>>> to define the correct scope for this and given the case also add >other >>>> sources, e.g. Kafka, HDFS. >>>> >>>> -- Query 10 is really oriented to testing Google Runner/IOs >specific >>>> features, so we have to decide what to do with this one, maybe >>>> mirroring it with Kafka/HDFS to have something equivalent in the >>>> Apache world. >>>> >>>> This is all for now, I am really glad that this is finally >happening >>>> and I hope this soon gets merged. >>>> >>>> Ismaël >>>> >>>> On Fri, May 12, 2017 at 6:07 PM, Lukasz Cwik ><lc...@google.com.invalid> >>>> wrote: >>>> >>>>> I think these are valuable enough that we should get them into >>>>> >>>> apache/master >>>> >>>>> On Fri, May 12, 2017 at 4:34 AM, Jean-Baptiste Onofré ><j...@nanthrax.net> >>>>> wrote: >>>>> >>>>> Hi, >>>>>> >>>>>> PR or even a feature branch could work. Up to you. >>>>>> >>>>>> Regards >>>>>> JB >>>>>> >>>>>> >>>>>> On 05/12/2017 10:55 AM, Etienne Chauchot wrote: >>>>>> >>>>>> Hi guys, >>>>>>> >>>>>>> I wanted to let you know that I have just submitted a PR around >>>>>>> >>>>>> NexMark. >>>> >>>>> This is >>>>>>> a port of the NexMark queries to Beam, to be used as integration >>>>>>> tests. >>>>>>> This can also be used as A-B testing (no-regression or >performance >>>>>>> comparison >>>>>>> between 2 versions of the same engine or of the same runner) >>>>>>> >>>>>>> This a continuation of the previous PR (#99) from Mark Shields. >>>>>>> The code has changed quite a bit: some queries have changed to >use new >>>>>>> Beam APIs >>>>>>> and there where some big refactorings. More important, we can >now run >>>>>>> >>>>>> all >>>> >>>>> the >>>>>>> queries in all the runners. >>>>>>> >>>>>>> Nevertheless, there are still some open issues in Nexmark >>>>>>> (https://github.com/iemejia/beam/issues) and in Beam upstream >(see >>>>>>> >>>>>> issue >>>> >>>>> links >>>>>>> in https://issues.apache.org/jira/browse/BEAM-160) >>>>>>> >>>>>>> I wanted to submit the PR before our (Ismaël and I) NexMark talk >at >>>>>>> the >>>>>>> ApacheCon. The PR is not perfect but it is in a good shape to >share >>>>>>> it. >>>>>>> >>>>>>> Best, >>>>>>> >>>>>>> Etienne >>>>>>> >>>>>>> >>>>>>> >>>>>>> Le 22/03/2017 à 04:51, Kenneth Knowles a écrit : >>>>>>> >>>>>>> This is great! Having a variety of realistic-ish pipelines >running on >>>>>>>> >>>>>>> all >>>> >>>>> runners complements the validation suite and IO IT work. >>>>>>>> >>>>>>>> If I recall, some of these involve heavy and esoteric uses of >state, >>>>>>>> >>>>>>> so >>>> >>>>> definitely give me a ping if you hit any trouble. >>>>>>>> >>>>>>>> Kenn >>>>>>>> >>>>>>>> On Tue, Mar 21, 2017 at 9:38 AM, Etienne Chauchot < >>>>>>>> >>>>>>> echauc...@gmail.com> >>>> >>>>> wrote: >>>>>>>> >>>>>>>> Hi all, >>>>>>>> >>>>>>>>> Ismael and I are working on upgrading the Nexmark >implementation for >>>>>>>>> Beam. >>>>>>>>> See https://github.com/iemejia/beam/tree/BEAM-160-nexmark and >>>>>>>>> https://issues.apache.org/jira/browse/BEAM-160. We are >continuing >>>>>>>>> >>>>>>>> the >>>> >>>>> work done by Mark Shields. See https://github.com/apache/ >>>>>>>>> >>>>>>>> beam/pull/366 >>>> >>>>> for the original PR. >>>>>>>>> >>>>>>>>> The PR contains queries that have a wide coverage of the Beam >model >>>>>>>>> >>>>>>>> and >>>> >>>>> that represent a realistic end user use case (some come from >client >>>>>>>>> experience on Google Cloud Dataflow). >>>>>>>>> >>>>>>>>> So far, we have upgraded the implementation to the latest Beam >>>>>>>>> >>>>>>>> snapshot. >>>> >>>>> And we are able to execute a good subset of the queries in the >>>>>>>>> >>>>>>>> different >>>> >>>>> runners. We upgraded the nexmark drivers to do so: direct driver >>>>>>>>> (upgraded >>>>>>>>> from inProcessDriver) and flink driver and we added a new one >for >>>>>>>>> >>>>>>>> spark. >>>> >>>>> There is still a good amount of work to do and we would like to >know >>>>>>>>> >>>>>>>> if >>>> >>>>> you think that this contribution can have its place into Beam >>>>>>>>> eventually. >>>>>>>>> >>>>>>>>> The interests of having Nexmark on Beam that we have seen so >far >>>>>>>>> are: >>>>>>>>> >>>>>>>>> - Rich batch/streaming test >>>>>>>>> >>>>>>>>> - A-B testing of runners or runtimes (non-regression, >performance >>>>>>>>> comparison between versions ...) >>>>>>>>> >>>>>>>>> - Integration testing (sdk/runners, runner/runtime, ...) >>>>>>>>> >>>>>>>>> - Validate beam capability matrix >>>>>>>>> >>>>>>>>> - It can be used as part of the ongoing PerfKit work (if there >is >>>>>>>>> any >>>>>>>>> interest). >>>>>>>>> >>>>>>>>> As a final note, we are tracking the issues in the same repo. >If >>>>>>>>> >>>>>>>> someone >>>> >>>>> is interested in contributing, or have more ideas, you are welcome >:) >>>>>>>>> >>>>>>>>> Etienne >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>> Jean-Baptiste Onofré >>>>>> jbono...@apache.org >>>>>> http://blog.nanthrax.net >>>>>> Talend - http://www.talend.com >>>>>> >>>>>> >>