IIRC it is also exciting in that the final window is smaller than the union of its component windows. I think we have a JIRA open to decide if that is even allowed.
On Wed, Sep 20, 2017 at 12:15 AM, Etienne Chauchot <[email protected]> wrote: > Hi > Indeed, the query builds a lot of maps, it is thus expensive. I totally > agree with your point, adding a query with a simple merge such as the one > that is done in the ValidatesRunner bellow is a good idea. I'll add 2 > tickets, one for the migration of winningBids to state API and one for the > creation of query 13 that illustrates a simple custom window merge. > > Etienne > > > Le 19/09/2017 à 18:09, Reuven Lax a écrit : > >> On Tue, Sep 19, 2017 at 7:29 AM, Etienne Chauchot <[email protected]> >> wrote: >> >> Hi all, >>> >>> I'm resuming my work on Nexmark a bit, starting to do some maintenance on >>> the tickets >>> >>> @Reuven: I have some comments inline below. >>> >>> Le 14/05/2017 à 14:29, Reuven Lax a écrit : >>> >>> Great to hear! A couple of comments: >>>> >>>> When Query 10 was written, the file-based sinks did not supported >>>> unbounded >>>> input. Now that in Beam FileBasedSink supports windowed output files, I >>>> think we should just rip out the custom IO code in Query 10 and replace >>>> it >>>> with AvroIO - this is closer to what real Beam users will do, and it >>>> will >>>> also make it support HDFS. >>>> >>>> +1: I updated this ticket https://issues.apache.org/jira >>> /browse/BEAM-2856 >>> >>> Query 10 also tests some subtle semantics around late data - notably that >>>> if an element from a source is not late, elements resulting from >>>> processing >>>> that element are not late. Essentially this is a correctness test for >>>> watermarks, and should apply to all runners IMO. >>>> >>>> Yes I agree, but there is some ValidatesRunner tests around this, right? >>> If not, we should create some IMHO. >>> >>> WinningBids.java (used in Query6) uses a fairly awkward (and >>>> computationally expensive) custom merging window function - largely >>>> because >>>> Mark was trying to avoid using the state API as much as he could (at the >>>> time there was no public state API). IMO we should rewrite WinningBids >>>> to >>>> use state. This should result in both cleaner code, and more efficient >>>> query. >>>> >>>> I agree that this query is a bit awkward. But it is the only one in the >>> query set that illustrates custom window merging. There is already query >>> 3 >>> that illustrates the use of state API (I migrated it to use state API >>> after >>> Mark released it). Even if there is now a ValidatesRunner test on custom >>> window merging ([1]), I believe it could be useful to keep WinningBids as >>> it is to serve as benchmark of custom window merging in the runners. >>> >>> My memory wast that this was an awkward use of merging windows, as the >> merge function was very expensive (building maps, etc.). As such, the cost >> of the WinningBid merge function dominated, so it really just served as a >> benchmark of how often windows were merged (i.e. merge is called very >> often in streaming runners an much less often in batch runners). >> >> I wonder if we're best off introducing a new query that more explicitly >> tests merging windows, with a more-reasonable merging window fn. >> > > WDYT? >>> >>> [1] https://github.com/apache/beam/blob/c65aca07faf7b8c4dabe6cae >>> 7b5b52286d2b25b1/sdks/java/core/src/test/java/org/apache/ >>> beam/sdk/transforms/windowing/WindowTest.java#L591 >>> >>> Best, >>> Etienne >>> >>> >>> Reuven >>>> >>>> On Sun, May 14, 2017 at 3:09 AM, Ismaël Mejía <[email protected]> >>>> wrote: >>>> >>>> Hello, >>>> >>>>> Thanks Etienne for opening the Pull Request and starting the >>>>> discussion for the review process. I also want to thank publicly all >>>>> the people that somehow contributed to this: >>>>> >>>>> - Mark Shields and the original people at google who worked at nexmark >>>>> for contributing this in the first place. >>>>> - Etienne because his work and constant help really improved the >>>>> status of the queries, your work on query 3 was really nice, and also >>>>> for the hard work of helping me test all the queries with all the >>>>> runners and ping the runner maintainers for fixes. >>>>> - Aviem/Amit for all the help to solve the issues with the spark >>>>> runner whose support is now almost feature complete (even in >>>>> streaming!). >>>>> - Aljoscha/Jinsong for the fix to merge IntervalWindowFn and for >>>>> quickly adding the support for metrics. >>>>> - Thomas Groh and Kenneth for fixing some needed parts in Direct >>>>> Runner + answering our questions on the State/Timer API. >>>>> - JB and the talend crew for all the feedback and help to run in our >>>>> benchmark cluster. >>>>> - And of course the rest of the Beam community :) >>>>> >>>>> Some comments: >>>>> >>>>> - This does not need to have a feature branch since we have been >>>>> working on this in a fork for months now and with the stable API we >>>>> can simply do a traditional PR review. Of course the review is a bit >>>>> bigger so we expect it to take some time, but I hope we can get some >>>>> quick progress once FSR is out. >>>>> >>>>> - We need a hand from the google guys, for the moment we have tested >>>>> all the queries in all the runners, but not in the Dataflow runner >>>>> because we don't have access to it (well we have but not with the >>>>> freedom that you guys have to run the benchmark at will), so if we can >>>>> get some access that would be nice or if this is not possible, it >>>>> would be nice if some of you guys help us test/report any given issue >>>>> on this runner, >>>>> >>>>> - We also have to decide the future of some features, this is probably >>>>> independent of the current PR and part of the evolution of Nexmark on >>>>> Beam: >>>>> >>>>> -- There are still some pending things that can be improved even after >>>>> the review once in master, e.g. we have for the moment only synthetic >>>>> sources but the original version took also data from Pubsub, we have >>>>> to define the correct scope for this and given the case also add other >>>>> sources, e.g. Kafka, HDFS. >>>>> >>>>> -- Query 10 is really oriented to testing Google Runner/IOs specific >>>>> features, so we have to decide what to do with this one, maybe >>>>> mirroring it with Kafka/HDFS to have something equivalent in the >>>>> Apache world. >>>>> >>>>> This is all for now, I am really glad that this is finally happening >>>>> and I hope this soon gets merged. >>>>> >>>>> Ismaël >>>>> >>>>> On Fri, May 12, 2017 at 6:07 PM, Lukasz Cwik <[email protected] >>>>> > >>>>> wrote: >>>>> >>>>> I think these are valuable enough that we should get them into >>>>>> >>>>>> apache/master >>>>> >>>>> On Fri, May 12, 2017 at 4:34 AM, Jean-Baptiste Onofré <[email protected] >>>>>> > >>>>>> wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>>> PR or even a feature branch could work. Up to you. >>>>>>> >>>>>>> Regards >>>>>>> JB >>>>>>> >>>>>>> >>>>>>> On 05/12/2017 10:55 AM, Etienne Chauchot wrote: >>>>>>> >>>>>>> Hi guys, >>>>>>> >>>>>>>> I wanted to let you know that I have just submitted a PR around >>>>>>>> >>>>>>>> NexMark. >>>>>>> >>>>>> This is >>>>>> >>>>>>> a port of the NexMark queries to Beam, to be used as integration >>>>>>>> tests. >>>>>>>> This can also be used as A-B testing (no-regression or performance >>>>>>>> comparison >>>>>>>> between 2 versions of the same engine or of the same runner) >>>>>>>> >>>>>>>> This a continuation of the previous PR (#99) from Mark Shields. >>>>>>>> The code has changed quite a bit: some queries have changed to use >>>>>>>> new >>>>>>>> Beam APIs >>>>>>>> and there where some big refactorings. More important, we can now >>>>>>>> run >>>>>>>> >>>>>>>> all >>>>>>> >>>>>> the >>>>>> >>>>>>> queries in all the runners. >>>>>>>> >>>>>>>> Nevertheless, there are still some open issues in Nexmark >>>>>>>> (https://github.com/iemejia/beam/issues) and in Beam upstream (see >>>>>>>> >>>>>>>> issue >>>>>>> >>>>>> links >>>>>> >>>>>>> in https://issues.apache.org/jira/browse/BEAM-160) >>>>>>>> >>>>>>>> I wanted to submit the PR before our (Ismaël and I) NexMark talk at >>>>>>>> the >>>>>>>> ApacheCon. The PR is not perfect but it is in a good shape to share >>>>>>>> it. >>>>>>>> >>>>>>>> Best, >>>>>>>> >>>>>>>> Etienne >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Le 22/03/2017 à 04:51, Kenneth Knowles a écrit : >>>>>>>> >>>>>>>> This is great! Having a variety of realistic-ish pipelines running >>>>>>>> on >>>>>>>> all >>>>>>>> >>>>>>> runners complements the validation suite and IO IT work. >>>>>> >>>>>>> If I recall, some of these involve heavy and esoteric uses of state, >>>>>>>>> >>>>>>>>> so >>>>>>>> >>>>>>> definitely give me a ping if you hit any trouble. >>>>>> >>>>>>> Kenn >>>>>>>>> >>>>>>>>> On Tue, Mar 21, 2017 at 9:38 AM, Etienne Chauchot < >>>>>>>>> >>>>>>>>> [email protected]> >>>>>>>> >>>>>>> wrote: >>>>>> >>>>>>> Hi all, >>>>>>>>> >>>>>>>>> Ismael and I are working on upgrading the Nexmark implementation >>>>>>>>>> for >>>>>>>>>> Beam. >>>>>>>>>> See https://github.com/iemejia/beam/tree/BEAM-160-nexmark and >>>>>>>>>> https://issues.apache.org/jira/browse/BEAM-160. We are continuing >>>>>>>>>> >>>>>>>>>> the >>>>>>>>> >>>>>>>> work done by Mark Shields. See https://github.com/apache/ >>>>>> >>>>>>> beam/pull/366 >>>>>>>>> >>>>>>>> for the original PR. >>>>>> >>>>>>> The PR contains queries that have a wide coverage of the Beam model >>>>>>>>>> >>>>>>>>>> and >>>>>>>>> >>>>>>>> that represent a realistic end user use case (some come from client >>>>>> >>>>>>> experience on Google Cloud Dataflow). >>>>>>>>>> >>>>>>>>>> So far, we have upgraded the implementation to the latest Beam >>>>>>>>>> >>>>>>>>>> snapshot. >>>>>>>>> >>>>>>>> And we are able to execute a good subset of the queries in the >>>>>> >>>>>>> different >>>>>>>>> >>>>>>>> runners. We upgraded the nexmark drivers to do so: direct driver >>>>>> >>>>>>> (upgraded >>>>>>>>>> from inProcessDriver) and flink driver and we added a new one for >>>>>>>>>> >>>>>>>>>> spark. >>>>>>>>> >>>>>>>> There is still a good amount of work to do and we would like to know >>>>>> >>>>>>> if >>>>>>>>> >>>>>>>> you think that this contribution can have its place into Beam >>>>>> >>>>>>> eventually. >>>>>>>>>> >>>>>>>>>> The interests of having Nexmark on Beam that we have seen so far >>>>>>>>>> are: >>>>>>>>>> >>>>>>>>>> - Rich batch/streaming test >>>>>>>>>> >>>>>>>>>> - A-B testing of runners or runtimes (non-regression, performance >>>>>>>>>> comparison between versions ...) >>>>>>>>>> >>>>>>>>>> - Integration testing (sdk/runners, runner/runtime, ...) >>>>>>>>>> >>>>>>>>>> - Validate beam capability matrix >>>>>>>>>> >>>>>>>>>> - It can be used as part of the ongoing PerfKit work (if there is >>>>>>>>>> any >>>>>>>>>> interest). >>>>>>>>>> >>>>>>>>>> As a final note, we are tracking the issues in the same repo. If >>>>>>>>>> >>>>>>>>>> someone >>>>>>>>> >>>>>>>> is interested in contributing, or have more ideas, you are welcome >>>>>> :) >>>>>> >>>>>>> Etienne >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> >>>>>>>>> Jean-Baptiste Onofré >>>>>>> [email protected] >>>>>>> http://blog.nanthrax.net >>>>>>> Talend - http://www.talend.com >>>>>>> >>>>>>> >>>>>>> >
