Re: [DISCUSSION] using NexMark for Beam

Etienne Chauchot Wed, 20 Sep 2017 00:16:06 -0700

Hi

Indeed, the query builds a lot of maps, it is thus expensive. I totallyagree with your point, adding a query with a simple merge such as theone that is done in the ValidatesRunner bellow is a good idea. I'll add2 tickets, one for the migration of winningBids to state API and one forthe creation of query 13 that illustrates a simple custom window merge.


Etienne

Le 19/09/2017 à 18:09, Reuven Lax a écrit :

On Tue, Sep 19, 2017 at 7:29 AM, Etienne Chauchot <[email protected]>
wrote:

Hi all,

I'm resuming my work on Nexmark a bit, starting to do some maintenance on
the tickets

@Reuven: I have some comments inline below.

Le 14/05/2017 à 14:29, Reuven Lax a écrit :

Great to hear! A couple of comments:

When Query 10 was written, the file-based sinks did not supported
unbounded
input. Now that in Beam FileBasedSink supports windowed output files, I
think we should just rip out the custom IO code in Query 10 and replace it
with AvroIO  - this is closer to what real Beam users will do, and it will
also make it support HDFS.

+1: I updated this ticket https://issues.apache.org/jira/browse/BEAM-2856

Query 10 also tests some subtle semantics around late data - notably that
if an element from a source is not late, elements resulting from
processing
that element are not late. Essentially this is a correctness test for
watermarks, and should apply to all runners IMO.

Yes I agree, but there is some ValidatesRunner tests around this, right?
If not, we should create some IMHO.

WinningBids.java (used in Query6) uses a fairly awkward (and
computationally expensive) custom merging window function - largely
because
Mark was trying to avoid using the state API as much as he could (at the
time there was no public state API). IMO we should rewrite WinningBids to
use state. This should result in both cleaner code, and more efficient
query.

I agree that this query is a bit awkward. But it is the only one in the
query set that illustrates custom window merging. There is already query 3
that illustrates the use of state API (I migrated it to use state API after
Mark released it). Even if there is now a ValidatesRunner test on custom
window merging ([1]), I believe it could be useful to keep WinningBids as
it is to serve as benchmark of custom window merging in the runners.

My memory wast that this was an awkward use of merging windows, as the
merge function was very expensive (building maps, etc.). As such, the cost
of the WinningBid merge function dominated, so it really just served as a
benchmark of how  often windows were merged (i.e. merge is called very
often in streaming runners an much less often in batch runners).

I wonder if we're best off introducing a new query that more explicitly
tests merging windows, with a more-reasonable merging window fn.

WDYT?

[1] https://github.com/apache/beam/blob/c65aca07faf7b8c4dabe6cae
7b5b52286d2b25b1/sdks/java/core/src/test/java/org/apache/
beam/sdk/transforms/windowing/WindowTest.java#L591

Best,
Etienne

Reuven

On Sun, May 14, 2017 at 3:09 AM, Ismaël Mejía <[email protected]> wrote:

Hello,

Thanks Etienne for opening the Pull Request and starting the
discussion for the review process. I also want to thank publicly all
the people that somehow contributed to this:

- Mark Shields and the original people at google who worked at nexmark
for contributing this in the first place.
- Etienne because his work and constant help really improved the
status of the queries, your work on query 3 was really nice, and also
for the hard work of helping me test all the queries with all the
runners and ping the runner maintainers for fixes.
- Aviem/Amit for all the help to solve the issues with the spark
runner whose support is now almost feature complete (even in
streaming!).
- Aljoscha/Jinsong for the fix to merge IntervalWindowFn and for
quickly adding the support for metrics.
- Thomas Groh and Kenneth for fixing some needed parts in Direct
Runner + answering our questions on the State/Timer API.
- JB and the talend crew for all the feedback and help to run in our
benchmark cluster.
- And of course the rest of the Beam community :)

Some comments:

- This does not need to have a feature branch since we have been
working on this in a fork for months now and with the stable API we
can simply do a traditional PR review. Of course the review is a bit
bigger so we expect it to take some time, but I hope we can get some
quick progress once FSR is out.

- We need a hand from the google guys, for the moment we have tested
all the queries in all the runners, but not in the Dataflow runner
because we don't have access to it (well we have but not with the
freedom that you guys have to run the benchmark at will), so if we can
get some access that would be nice or if this is not possible, it
would be nice if some of you guys help us test/report any given issue
on this runner,

- We also have to decide the future of some features, this is probably
independent of the current PR and part of the evolution of Nexmark on
Beam:

-- There are still some pending things that can be improved even after
the review once in master, e.g. we have for the moment only synthetic
sources but the original version took also data from Pubsub, we have
to define the correct scope for this and given the case also add other
sources, e.g. Kafka, HDFS.

-- Query 10 is really oriented to testing Google Runner/IOs specific
features, so we have to decide what to do with this one, maybe
mirroring it with Kafka/HDFS to have something equivalent in the
Apache world.

This is all for now, I am really glad that this is finally happening
and I hope this soon gets merged.

Ismaël

On Fri, May 12, 2017 at 6:07 PM, Lukasz Cwik <[email protected]>
wrote:

I think these are valuable enough that we should get them into

apache/master

On Fri, May 12, 2017 at 4:34 AM, Jean-Baptiste Onofré <[email protected]>
wrote:

Hi,

PR or even a feature branch could work. Up to you.

Regards
JB


On 05/12/2017 10:55 AM, Etienne Chauchot wrote:

Hi guys,

I wanted to let you know that I have just submitted a PR around

NexMark.

This is

a port of the NexMark queries to Beam, to be used as integration
tests.
This can also be used as A-B testing (no-regression or performance
comparison
between 2 versions of the same engine or of the same runner)

This a continuation of the previous PR (#99) from Mark Shields.
The code has changed quite a bit: some queries have changed to use new
Beam APIs
and there where some big refactorings. More important, we can now run

all

the

queries in all the runners.

Nevertheless, there are still some open issues in Nexmark
(https://github.com/iemejia/beam/issues) and in Beam upstream (see

issue

links

in https://issues.apache.org/jira/browse/BEAM-160)

I wanted to submit the PR before our (Ismaël and I) NexMark talk at
the
ApacheCon. The PR is not perfect but it is in a good shape to share
it.

Best,

Etienne



Le 22/03/2017 à 04:51, Kenneth Knowles a écrit :

This is great! Having a variety of realistic-ish pipelines running on
all

runners complements the validation suite and IO IT work.

If I recall, some of these involve heavy and esoteric uses of state,

so

definitely give me a ping if you hit any trouble.

Kenn

On Tue, Mar 21, 2017 at 9:38 AM, Etienne Chauchot <

[email protected]>

wrote:

Hi all,

Ismael and I are working on upgrading the Nexmark implementation for
Beam.
See https://github.com/iemejia/beam/tree/BEAM-160-nexmark and
https://issues.apache.org/jira/browse/BEAM-160. We are continuing

the

work done by Mark Shields. See https://github.com/apache/

beam/pull/366

for the original PR.

The PR contains queries that have a wide coverage of the Beam model

and

that represent a realistic end user use case (some come from client

experience on Google Cloud Dataflow).

So far, we have upgraded the implementation to the latest Beam

snapshot.

And we are able to execute a good subset of the queries in the

different

runners. We upgraded the nexmark drivers to do so: direct driver

(upgraded
from inProcessDriver) and flink driver and we added a new one for

spark.

There is still a good amount of work to do and we would like to know

if

you think that this contribution can have its place into Beam

eventually.

The interests of having Nexmark on Beam that we have seen so far
are:

- Rich batch/streaming test

- A-B testing of runners or runtimes (non-regression, performance
comparison between versions ...)

- Integration testing (sdk/runners, runner/runtime, ...)

- Validate beam capability matrix

- It can be used as part of the ongoing PerfKit work (if there is
any
interest).

As a final note, we are tracking the issues in the same repo. If

someone

is interested in contributing, or have more ideas, you are welcome :)

Etienne



--

Jean-Baptiste Onofré
[email protected]
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: [DISCUSSION] using NexMark for Beam

Reply via email to