Re: [DISCUSSION] using NexMark for Beam

Etienne Chauchot Tue, 19 Sep 2017 07:29:53 -0700

Hi all,

I'm resuming my work on Nexmark a bit, starting to do some maintenanceon the tickets


@Reuven: I have some comments inline below.

Le 14/05/2017 à 14:29, Reuven Lax a écrit :

Great to hear! A couple of comments:

When Query 10 was written, the file-based sinks did not supported unbounded
input. Now that in Beam FileBasedSink supports windowed output files, I
think we should just rip out the custom IO code in Query 10 and replace it
with AvroIO  - this is closer to what real Beam users will do, and it will
also make it support HDFS.

+1: I updated this ticket https://issues.apache.org/jira/browse/BEAM-2856


Query 10 also tests some subtle semantics around late data - notably that
if an element from a source is not late, elements resulting from processing
that element are not late. Essentially this is a correctness test for
watermarks, and should apply to all runners IMO.

Yes I agree, but there is some ValidatesRunner tests around this, right?If not, we should create some IMHO.


WinningBids.java (used in Query6) uses a fairly awkward (and
computationally expensive) custom merging window function - largely because
Mark was trying to avoid using the state API as much as he could (at the
time there was no public state API). IMO we should rewrite WinningBids to
use state. This should result in both cleaner code, and more efficient
query.

I agree that this query is a bit awkward. But it is the only one in thequery set that illustrates custom window merging. There is already query3 that illustrates the use of state API (I migrated it to use state APIafter Mark released it). Even if there is now a ValidatesRunner test oncustom window merging ([1]), I believe it could be useful to keepWinningBids as it is to serve as benchmark of custom window merging inthe runners.


WDYT?

[1]https://github.com/apache/beam/blob/c65aca07faf7b8c4dabe6cae7b5b52286d2b25b1/sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/windowing/WindowTest.java#L591


Best,
Etienne


Reuven

On Sun, May 14, 2017 at 3:09 AM, Ismaël Mejía <ieme...@gmail.com> wrote:

Hello,

Thanks Etienne for opening the Pull Request and starting the
discussion for the review process. I also want to thank publicly all
the people that somehow contributed to this:

- Mark Shields and the original people at google who worked at nexmark
for contributing this in the first place.
- Etienne because his work and constant help really improved the
status of the queries, your work on query 3 was really nice, and also
for the hard work of helping me test all the queries with all the
runners and ping the runner maintainers for fixes.
- Aviem/Amit for all the help to solve the issues with the spark
runner whose support is now almost feature complete (even in
streaming!).
- Aljoscha/Jinsong for the fix to merge IntervalWindowFn and for
quickly adding the support for metrics.
- Thomas Groh and Kenneth for fixing some needed parts in Direct
Runner + answering our questions on the State/Timer API.
- JB and the talend crew for all the feedback and help to run in our
benchmark cluster.
- And of course the rest of the Beam community :)

Some comments:

- This does not need to have a feature branch since we have been
working on this in a fork for months now and with the stable API we
can simply do a traditional PR review. Of course the review is a bit
bigger so we expect it to take some time, but I hope we can get some
quick progress once FSR is out.

- We need a hand from the google guys, for the moment we have tested
all the queries in all the runners, but not in the Dataflow runner
because we don't have access to it (well we have but not with the
freedom that you guys have to run the benchmark at will), so if we can
get some access that would be nice or if this is not possible, it
would be nice if some of you guys help us test/report any given issue
on this runner,

- We also have to decide the future of some features, this is probably
independent of the current PR and part of the evolution of Nexmark on
Beam:

-- There are still some pending things that can be improved even after
the review once in master, e.g. we have for the moment only synthetic
sources but the original version took also data from Pubsub, we have
to define the correct scope for this and given the case also add other
sources, e.g. Kafka, HDFS.

-- Query 10 is really oriented to testing Google Runner/IOs specific
features, so we have to decide what to do with this one, maybe
mirroring it with Kafka/HDFS to have something equivalent in the
Apache world.

This is all for now, I am really glad that this is finally happening
and I hope this soon gets merged.

Ismaël

On Fri, May 12, 2017 at 6:07 PM, Lukasz Cwik <lc...@google.com.invalid>
wrote:

I think these are valuable enough that we should get them into

apache/master

On Fri, May 12, 2017 at 4:34 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

Hi,

PR or even a feature branch could work. Up to you.

Regards
JB


On 05/12/2017 10:55 AM, Etienne Chauchot wrote:

Hi guys,

I wanted to let you know that I have just submitted a PR around

NexMark.

This is
a port of the NexMark queries to Beam, to be used as integration tests.
This can also be used as A-B testing (no-regression or performance
comparison
between 2 versions of the same engine or of the same runner)

This a continuation of the previous PR (#99) from Mark Shields.
The code has changed quite a bit: some queries have changed to use new
Beam APIs
and there where some big refactorings. More important, we can now run

all

the
queries in all the runners.

Nevertheless, there are still some open issues in Nexmark
(https://github.com/iemejia/beam/issues) and in Beam upstream (see

issue

links
in https://issues.apache.org/jira/browse/BEAM-160)

I wanted to submit the PR before our (Ismaël and I) NexMark talk at the
ApacheCon. The PR is not perfect but it is in a good shape to share it.

Best,

Etienne



Le 22/03/2017 à 04:51, Kenneth Knowles a écrit :

This is great! Having a variety of realistic-ish pipelines running on

all

runners complements the validation suite and IO IT work.

If I recall, some of these involve heavy and esoteric uses of state,

so

definitely give me a ping if you hit any trouble.

Kenn

On Tue, Mar 21, 2017 at 9:38 AM, Etienne Chauchot <

echauc...@gmail.com>

wrote:

Hi all,

Ismael and I are working on upgrading the Nexmark implementation for
Beam.
See https://github.com/iemejia/beam/tree/BEAM-160-nexmark and
https://issues.apache.org/jira/browse/BEAM-160. We are continuing

the

work done by Mark Shields. See https://github.com/apache/

beam/pull/366

for the original PR.

The PR contains queries that have a wide coverage of the Beam model

and

that represent a realistic end user use case (some come from client
experience on Google Cloud Dataflow).

So far, we have upgraded the implementation to the latest Beam

snapshot.

And we are able to execute a good subset of the queries in the

different

runners. We upgraded the nexmark drivers to do so: direct driver
(upgraded
from inProcessDriver) and flink driver and we added a new one for

spark.

There is still a good amount of work to do and we would like to know

if

you think that this contribution can have its place into Beam
eventually.

The interests of having Nexmark on Beam that we have seen so far are:

- Rich batch/streaming test

- A-B testing of runners or runtimes (non-regression, performance
comparison between versions ...)

- Integration testing (sdk/runners, runner/runtime, ...)

- Validate beam capability matrix

- It can be used as part of the ongoing PerfKit work (if there is any
interest).

As a final note, we are tracking the issues in the same repo. If

someone

is interested in contributing, or have more ideas, you are welcome :)

Etienne

--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: [DISCUSSION] using NexMark for Beam

Reply via email to