Great to hear! A couple of comments: When Query 10 was written, the file-based sinks did not supported unbounded input. Now that in Beam FileBasedSink supports windowed output files, I think we should just rip out the custom IO code in Query 10 and replace it with AvroIO - this is closer to what real Beam users will do, and it will also make it support HDFS.
Query 10 also tests some subtle semantics around late data - notably that if an element from a source is not late, elements resulting from processing that element are not late. Essentially this is a correctness test for watermarks, and should apply to all runners IMO. WinningBids.java (used in Query6) uses a fairly awkward (and computationally expensive) custom merging window function - largely because Mark was trying to avoid using the state API as much as he could (at the time there was no public state API). IMO we should rewrite WinningBids to use state. This should result in both cleaner code, and more efficient query. Reuven On Sun, May 14, 2017 at 3:09 AM, Ismaël Mejía <ieme...@gmail.com> wrote: > Hello, > > Thanks Etienne for opening the Pull Request and starting the > discussion for the review process. I also want to thank publicly all > the people that somehow contributed to this: > > - Mark Shields and the original people at google who worked at nexmark > for contributing this in the first place. > - Etienne because his work and constant help really improved the > status of the queries, your work on query 3 was really nice, and also > for the hard work of helping me test all the queries with all the > runners and ping the runner maintainers for fixes. > - Aviem/Amit for all the help to solve the issues with the spark > runner whose support is now almost feature complete (even in > streaming!). > - Aljoscha/Jinsong for the fix to merge IntervalWindowFn and for > quickly adding the support for metrics. > - Thomas Groh and Kenneth for fixing some needed parts in Direct > Runner + answering our questions on the State/Timer API. > - JB and the talend crew for all the feedback and help to run in our > benchmark cluster. > - And of course the rest of the Beam community :) > > Some comments: > > - This does not need to have a feature branch since we have been > working on this in a fork for months now and with the stable API we > can simply do a traditional PR review. Of course the review is a bit > bigger so we expect it to take some time, but I hope we can get some > quick progress once FSR is out. > > - We need a hand from the google guys, for the moment we have tested > all the queries in all the runners, but not in the Dataflow runner > because we don't have access to it (well we have but not with the > freedom that you guys have to run the benchmark at will), so if we can > get some access that would be nice or if this is not possible, it > would be nice if some of you guys help us test/report any given issue > on this runner, > > - We also have to decide the future of some features, this is probably > independent of the current PR and part of the evolution of Nexmark on > Beam: > > -- There are still some pending things that can be improved even after > the review once in master, e.g. we have for the moment only synthetic > sources but the original version took also data from Pubsub, we have > to define the correct scope for this and given the case also add other > sources, e.g. Kafka, HDFS. > > -- Query 10 is really oriented to testing Google Runner/IOs specific > features, so we have to decide what to do with this one, maybe > mirroring it with Kafka/HDFS to have something equivalent in the > Apache world. > > This is all for now, I am really glad that this is finally happening > and I hope this soon gets merged. > > Ismaël > > On Fri, May 12, 2017 at 6:07 PM, Lukasz Cwik <lc...@google.com.invalid> > wrote: > > I think these are valuable enough that we should get them into > apache/master > > > > On Fri, May 12, 2017 at 4:34 AM, Jean-Baptiste Onofré <j...@nanthrax.net> > > wrote: > > > >> Hi, > >> > >> PR or even a feature branch could work. Up to you. > >> > >> Regards > >> JB > >> > >> > >> On 05/12/2017 10:55 AM, Etienne Chauchot wrote: > >> > >>> Hi guys, > >>> > >>> I wanted to let you know that I have just submitted a PR around > NexMark. > >>> This is > >>> a port of the NexMark queries to Beam, to be used as integration tests. > >>> This can also be used as A-B testing (no-regression or performance > >>> comparison > >>> between 2 versions of the same engine or of the same runner) > >>> > >>> This a continuation of the previous PR (#99) from Mark Shields. > >>> The code has changed quite a bit: some queries have changed to use new > >>> Beam APIs > >>> and there where some big refactorings. More important, we can now run > all > >>> the > >>> queries in all the runners. > >>> > >>> Nevertheless, there are still some open issues in Nexmark > >>> (https://github.com/iemejia/beam/issues) and in Beam upstream (see > issue > >>> links > >>> in https://issues.apache.org/jira/browse/BEAM-160) > >>> > >>> I wanted to submit the PR before our (Ismaël and I) NexMark talk at the > >>> ApacheCon. The PR is not perfect but it is in a good shape to share it. > >>> > >>> Best, > >>> > >>> Etienne > >>> > >>> > >>> > >>> Le 22/03/2017 à 04:51, Kenneth Knowles a écrit : > >>> > >>>> This is great! Having a variety of realistic-ish pipelines running on > all > >>>> runners complements the validation suite and IO IT work. > >>>> > >>>> If I recall, some of these involve heavy and esoteric uses of state, > so > >>>> definitely give me a ping if you hit any trouble. > >>>> > >>>> Kenn > >>>> > >>>> On Tue, Mar 21, 2017 at 9:38 AM, Etienne Chauchot < > echauc...@gmail.com> > >>>> wrote: > >>>> > >>>> Hi all, > >>>>> > >>>>> Ismael and I are working on upgrading the Nexmark implementation for > >>>>> Beam. > >>>>> See https://github.com/iemejia/beam/tree/BEAM-160-nexmark and > >>>>> https://issues.apache.org/jira/browse/BEAM-160. We are continuing > the > >>>>> work done by Mark Shields. See https://github.com/apache/ > beam/pull/366 > >>>>> for the original PR. > >>>>> > >>>>> The PR contains queries that have a wide coverage of the Beam model > and > >>>>> that represent a realistic end user use case (some come from client > >>>>> experience on Google Cloud Dataflow). > >>>>> > >>>>> So far, we have upgraded the implementation to the latest Beam > snapshot. > >>>>> And we are able to execute a good subset of the queries in the > different > >>>>> runners. We upgraded the nexmark drivers to do so: direct driver > >>>>> (upgraded > >>>>> from inProcessDriver) and flink driver and we added a new one for > spark. > >>>>> > >>>>> There is still a good amount of work to do and we would like to know > if > >>>>> you think that this contribution can have its place into Beam > >>>>> eventually. > >>>>> > >>>>> The interests of having Nexmark on Beam that we have seen so far are: > >>>>> > >>>>> - Rich batch/streaming test > >>>>> > >>>>> - A-B testing of runners or runtimes (non-regression, performance > >>>>> comparison between versions ...) > >>>>> > >>>>> - Integration testing (sdk/runners, runner/runtime, ...) > >>>>> > >>>>> - Validate beam capability matrix > >>>>> > >>>>> - It can be used as part of the ongoing PerfKit work (if there is any > >>>>> interest). > >>>>> > >>>>> As a final note, we are tracking the issues in the same repo. If > someone > >>>>> is interested in contributing, or have more ideas, you are welcome :) > >>>>> > >>>>> Etienne > >>>>> > >>>>> > >>>>> > >>> > >> -- > >> Jean-Baptiste Onofré > >> jbono...@apache.org > >> http://blog.nanthrax.net > >> Talend - http://www.talend.com > >> >