Thanks for reporting back Tim and great you found that it was a GC issue. In the meantime I filled BEAM-8384  for the `spark.default.parallelism` issue. The Spark runner (both classic and structured streaming) only in exceptional cases tries to improve the default configuration values, but if the user overwrites a default value the runner should respect this value otherwise users will be confused and won’t be able to use well known Spark tunings like in your case.
 https://issues.apache.org/jira/browse/BEAM-8384 On Tue, Oct 8, 2019 at 1:19 PM Tim Robertson <timrobertson...@gmail.com> wrote: > > I'm sorry for not replying. We are super busy trying to prepare data to > release. > > An update: > - We were using G1GC and through slack were advised against that. This fixed > the OOM error we saw and all our 2.15.0 jobs did complete > > When we have time (after 3 weeks) I'll try and isolate a test case with the > reshuffle example and parallelism. > > Thanks, > Tim > > > On Thu, Oct 3, 2019 at 1:21 PM Jan Lukavský <je...@seznam.cz> wrote: >> >> Hi Tim, >> >> can you please elaborate more about some parts? >> >> 1) What happens actually in your case? What is the specific settings you >> use? >> >> 3) Can you share stacktrace? Is it always the same, or does it change? >> >> The mentioned GroupCombineFunctions.java:202 comes from a Reshuffle, >> which seems to make a little sense to me regarding the logic you >> described. Do you use Reshuffle transform or does it expand from some >> other transform? >> >> Jan >> >> On 10/3/19 9:24 AM, Tim Robertson wrote: >> > Hi all, >> > >> > We haven't dug enough into this to know where to log issues, but I'll >> > start by sharing here. >> > >> > After upgrading from Beam 2.10.0 to 2.15.0 we see issues on >> > SparkRunner - we suspect all of this related. >> > >> > 1. spark.default.parallelism is not respected >> > >> > 2. File writing (Avro) with dynamic destinations (grouped into folders >> > by a field name) consistently fail with >> > org.apache.beam.sdk.util.UserCodeException: >> > java.nio.file.FileAlreadyExistsException: Unable to rename resource >> > hdfs://ha-nn/pipelines/export-20190930-0854/.temp-beam-d4fd89ed-fc7a-4b1e-aceb-68f9d72d50f0/6e086f60-8bda-4d0e-b29d-1b47fdfc88c0 >> > to >> > hdfs://ha-nn/pipelines/export-20190930-0854/7c9d2aec-f762-11e1-a439-00145eb45e9a/verbatimHBaseExport-00000-of-00001.avro >> > as destination already exists and couldn't be deleted. >> > >> > 3. GBK operations that run over 500M small records consistently fail >> > with OOM. We tried different configs with 48GB, 60GB, 80GB executor >> > memory >> > >> > Our pipelines run are batch, simple transformations with either an >> > HBaseSnapshot to Avro files or a merge of records in Avro (the GBK >> > issue) pushed to ElasticSearch (it fails upstream of the >> > ElasticsearchIO in the GBK stage). >> > >> > We notice operations that were mapToPair in 2.10.0 become repartition >> > operations ( (mapToPair at GroupCombineFunctions.java:68 becomes >> > repartition at GroupCombineFunctions.java:202)) which might be related >> > to this and looks surprising. >> > >> > I'll report more as we learn. If anyone has any immediate ideas based >> > on their commits or reviews or if you wish an tests run on other Beam >> > versions please say. >> > >> > Thanks, >> > Tim >> > >> > >> >