I'm not quite following what these sizes are needed for--aren't the
benchmarks already tuned to be specific, known sizes?

Maybe I wasn't clear enough. Such metric is useful mostly in IO tests -
different IOs generate records of different size. It would be ideal for us
to have a universal way to get total size so that we could provide some
throughput measurement (we can easily get time). In Load tests we indeed
have known sizes but as I said above in point 2 - maybe it's worthy to look
at the other size as well (to compare)?

especially for benchmarking purposes a 5x
overhead means you're benchmarking the sizing code, not the pipeline
itself.

Exactly. We don't want to do this.

Beam computes estimates for PCollection sizes by using coder and
sampling and publishes these as counters. It'd be best IMHO to reuse
this. Are these counters not sufficient?

I didn't know that and this should do the trick! Is such counter available
for all sdks (or at least Python and Java)? Is it supported for all runners
(or at least Flink and Dataflow)? Where can I find it to see if it fits?

Thanks!


wt., 28 maj 2019 o 16:46 Robert Bradshaw <[email protected]> napisał(a):

> I'm not quite following what these sizes are needed for--aren't the
> benchmarks already tuned to be specific, known sizes? I agree that
> this can be expensive; especially for benchmarking purposes a 5x
> overhead means you're benchmarking the sizing code, not the pipeline
> itself.
>
> Beam computes estimates for PCollection sizes by using coder and
> sampling, and publishes these as counters. It'd be best IMHO to reuse
> this. Are these counters not sufficient?
>
> On Tue, May 28, 2019 at 12:55 PM Łukasz Gajowy <[email protected]> wrote:
> >
> > Hi all,
> >
> > part of our work while creating benchmarks for Beam is to collect total
> data size (bytes) that was put inside the testing pipeline. We need that in
> load tests of core beam operations (to see how big was the load really) and
> IO tests (to calculate throughput). The "not so good" way we're doing it
> right now is that we add a DoFn step called "ByteMonitor" to the pipeline
> to get the size of every element using a utility called
> "ObjectSizeCalculator [1].
> >
> > Problems with this approach:
> > 1. It's computationally expensive. After introducing this change, tests
> are 5x slower than before. This is due to the fact that now the size of
> each record is calculated separately.
> > 2. Naturally, the size of a particular record measured this way is
> greater than the size of the generated key+values itself. Eg. if a
> synthetic source generates key + value that has 10 bytes total, after
> collecting the total bytes metric it's 8x greater (due to wrapping the
> value in richer objects, allocating more memory than needed, etc).
> >
> > The main question here is: which size of particular records is more
> interesting in benchmarks? The, let's call it, "net" size (key + value
> size, and nothing else), or the "gross" size (including all allocated
> memory for a particular element in PCollection and all the overhead of
> wrapping it in richer objects)? Maybe both sizes are good to be measured?
> >
> > For the "net" size we probably could (should?) do something similar to
> what Nexmark suites have: pre-define size per each element type and read it
> once the element is spotted in the pipeline [3].
> >
> > What do you think? Is there any other (efficient + reliable) way of
> measuring the total load size that I missed?
> >
> > Thanks for opinions!
> >
> > Best,
> > Łukasz
> >
> > [1]
> https://github.com/apache/beam/blob/a16a5b71cf8d399070a72b0f062693180d56b5ed/sdks/java/testing/test-utils/src/main/java/org/apache/beam/sdk/testutils/metrics/ByteMonitor.java
> > [2] https://issues.apache.org/jira/browse/BEAM-7431
> > [3]
> https://github.com/apache/beam/blob/eb3b57554d9dc4057ad79bdd56c4239bd4204656/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/model/KnownSize.java
>

Reply via email to