The Go SDK doesn't yet have these counters implemented or published
(sampling elements &countinf between DoFns, etc).

On Tue, May 28, 2019, 9:08 AM Alexey Romanenko <[email protected]>
wrote:

> On 28 May 2019, at 17:31, Łukasz Gajowy <[email protected]> wrote:
>
>
> I'm not quite following what these sizes are needed for--aren't the
> benchmarks already tuned to be specific, known sizes?
>
> Maybe I wasn't clear enough. Such metric is useful mostly in IO tests -
> different IOs generate records of different size. It would be ideal for us
> to have a universal way to get total size so that we could provide some
> throughput measurement (we can easily get time). In Load tests we indeed
> have known sizes but as I said above in point 2 - maybe it's worthy to look
> at the other size as well (to compare)?
>
>
> Łukasz, I’m sorry but it’s still not clear for me - what is a point to
> compare these sizes? I want to say that If we already have a size of
> generated load (like expected data size) and processing time after the test
> run, then we can calculate throughput. In addition, we compute and check a
> hash of all processed data and compare it with expected hash to make sure
> that there is no data loss or corruption.  Do I miss something?
>
>
>
> especially for benchmarking purposes a 5x
> overhead means you're benchmarking the sizing code, not the pipeline
> itself.
>
> Exactly. We don't want to do this.
>
> Beam computes estimates for PCollection sizes by using coder and
> sampling and publishes these as counters. It'd be best IMHO to reuse
> this. Are these counters not sufficient?
>
> I didn't know that and this should do the trick! Is such counter available
> for all sdks (or at least Python and Java)? Is it supported for all runners
> (or at least Flink and Dataflow)? Where can I find it to see if it fits?
>
> Thanks!
>
>
> wt., 28 maj 2019 o 16:46 Robert Bradshaw <[email protected]> napisał(a):
>
>> I'm not quite following what these sizes are needed for--aren't the
>> benchmarks already tuned to be specific, known sizes? I agree that
>> this can be expensive; especially for benchmarking purposes a 5x
>> overhead means you're benchmarking the sizing code, not the pipeline
>> itself.
>>
>> Beam computes estimates for PCollection sizes by using coder and
>> sampling, and publishes these as counters. It'd be best IMHO to reuse
>> this. Are these counters not sufficient?
>>
>> On Tue, May 28, 2019 at 12:55 PM Łukasz Gajowy <[email protected]>
>> wrote:
>> >
>> > Hi all,
>> >
>> > part of our work while creating benchmarks for Beam is to collect total
>> data size (bytes) that was put inside the testing pipeline. We need that in
>> load tests of core beam operations (to see how big was the load really) and
>> IO tests (to calculate throughput). The "not so good" way we're doing it
>> right now is that we add a DoFn step called "ByteMonitor" to the pipeline
>> to get the size of every element using a utility called
>> "ObjectSizeCalculator [1].
>> >
>> > Problems with this approach:
>> > 1. It's computationally expensive. After introducing this change, tests
>> are 5x slower than before. This is due to the fact that now the size of
>> each record is calculated separately.
>> > 2. Naturally, the size of a particular record measured this way is
>> greater than the size of the generated key+values itself. Eg. if a
>> synthetic source generates key + value that has 10 bytes total, after
>> collecting the total bytes metric it's 8x greater (due to wrapping the
>> value in richer objects, allocating more memory than needed, etc).
>> >
>> > The main question here is: which size of particular records is more
>> interesting in benchmarks? The, let's call it, "net" size (key + value
>> size, and nothing else), or the "gross" size (including all allocated
>> memory for a particular element in PCollection and all the overhead of
>> wrapping it in richer objects)? Maybe both sizes are good to be measured?
>> >
>> > For the "net" size we probably could (should?) do something similar to
>> what Nexmark suites have: pre-define size per each element type and read it
>> once the element is spotted in the pipeline [3].
>> >
>> > What do you think? Is there any other (efficient + reliable) way of
>> measuring the total load size that I missed?
>> >
>> > Thanks for opinions!
>> >
>> > Best,
>> > Łukasz
>> >
>> > [1]
>> https://github.com/apache/beam/blob/a16a5b71cf8d399070a72b0f062693180d56b5ed/sdks/java/testing/test-utils/src/main/java/org/apache/beam/sdk/testutils/metrics/ByteMonitor.java
>> > [2] https://issues.apache.org/jira/browse/BEAM-7431
>> > [3]
>> https://github.com/apache/beam/blob/eb3b57554d9dc4057ad79bdd56c4239bd4204656/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/model/KnownSize.java
>>
>
>

Reply via email to