Hi there,
I want to add some load tests for core Beam operations (GBK, CoGBK,
Combine, ParDo, SideInput) on portable Flink in Java. For some reason some
of my test scenarios for Combine seem to crash the whole Flink cluster.
The PR introducing the tests is here:
https://github.com/apache/beam/pull/10386
As you can see in the logs of failing Jenkins jobs, the scenario combining
2GB of records on 5 workers works well, whereas 2GB fanned out on 16
workers fails with a cryptic message of either
"org.apache.beam.vendor.grpc.v1p26p0.io.grpc.StatusRuntimeException:
CANCELLED: cancelled before receiving half close"
or
"java.util.concurrent.TimeoutException: The heartbeat of TaskManager with
id container_e01_1580289509522_0001_01_000002 timed out."

link to logs with first error:
https://builds.apache.org/job/beam_LoadTests_Java_Combine_Portable_Flink_Batch_PR/9/console

link to logs with second error:
https://builds.apache.org/job/beam_LoadTests_Java_Combine_Portable_Flink_Batch_PR/11/console

I wonder what the issue might be here. When I ran the same load test on a
Flink cluster created in an identical way on a separate GCP project,
everything went well, which makes me think there may be something wrong
with the Dataproc setup.

Another important note - Combine load tests work and pass with the Python
portable runner.

I will be grateful for any help you can provide - I'm not a Flink cluster
expert and I have no idea how can I change configurations so that it works.

Thanks and have a good day!
Michal

-- 

Michał Walenia
Polidea <https://www.polidea.com/> | Software Engineer

M: +48 791 432 002 <+48791432002>
E: michal.wale...@polidea.com

Unique Tech
Check out our projects! <https://www.polidea.com/our-work>

Reply via email to