Beam 2.15.0 SparkRunner issues

Tim Robertson Thu, 03 Oct 2019 00:26:11 -0700

Hi all,

We haven't dug enough into this to know where to log issues, but I'll start
by sharing here.


After upgrading from Beam 2.10.0 to 2.15.0 we see issues on SparkRunner -
we suspect all of this related.

1. spark.default.parallelism is not respected

2. File writing (Avro) with dynamic destinations (grouped into folders by a
field name) consistently fail with
org.apache.beam.sdk.util.UserCodeException:
java.nio.file.FileAlreadyExistsException: Unable to rename resource
hdfs://ha-nn/pipelines/export-20190930-0854/.temp-beam-d4fd89ed-fc7a-4b1e-aceb-68f9d72d50f0/6e086f60-8bda-4d0e-b29d-1b47fdfc88c0
to
hdfs://ha-nn/pipelines/export-20190930-0854/7c9d2aec-f762-11e1-a439-00145eb45e9a/verbatimHBaseExport-00000-of-00001.avro
as destination already exists and couldn't be deleted.

3. GBK operations that run over 500M small records consistently fail with
OOM. We tried different configs with 48GB, 60GB, 80GB executor memory

Our pipelines run are batch, simple transformations with either an
HBaseSnapshot to Avro files or a merge of records in Avro (the GBK issue)
pushed to ElasticSearch (it fails upstream of the ElasticsearchIO in the
GBK stage).

We notice operations that were mapToPair  in 2.10.0 become repartition
operations ( (mapToPair at GroupCombineFunctions.java:68 becomes
repartition at GroupCombineFunctions.java:202)) which might be related to
this and looks surprising.

I'll report more as we learn. If anyone has any immediate ideas based on
their commits or reviews or if you wish an tests run on other Beam versions
please say.

Thanks,
Tim

Beam 2.15.0 SparkRunner issues

Reply via email to