I'm sorry for not replying. We are super busy trying to prepare data to release.
An update: - We were using G1GC and through slack were advised against that. This fixed the OOM error we saw and all our 2.15.0 jobs did complete When we have time (after 3 weeks) I'll try and isolate a test case with the reshuffle example and parallelism. Thanks, Tim On Thu, Oct 3, 2019 at 1:21 PM Jan Lukavský <[email protected]> wrote: > Hi Tim, > > can you please elaborate more about some parts? > > 1) What happens actually in your case? What is the specific settings you > use? > > 3) Can you share stacktrace? Is it always the same, or does it change? > > The mentioned GroupCombineFunctions.java:202 comes from a Reshuffle, > which seems to make a little sense to me regarding the logic you > described. Do you use Reshuffle transform or does it expand from some > other transform? > > Jan > > On 10/3/19 9:24 AM, Tim Robertson wrote: > > Hi all, > > > > We haven't dug enough into this to know where to log issues, but I'll > > start by sharing here. > > > > After upgrading from Beam 2.10.0 to 2.15.0 we see issues on > > SparkRunner - we suspect all of this related. > > > > 1. spark.default.parallelism is not respected > > > > 2. File writing (Avro) with dynamic destinations (grouped into folders > > by a field name) consistently fail with > > org.apache.beam.sdk.util.UserCodeException: > > java.nio.file.FileAlreadyExistsException: Unable to rename resource > > > hdfs://ha-nn/pipelines/export-20190930-0854/.temp-beam-d4fd89ed-fc7a-4b1e-aceb-68f9d72d50f0/6e086f60-8bda-4d0e-b29d-1b47fdfc88c0 > > > to > > > hdfs://ha-nn/pipelines/export-20190930-0854/7c9d2aec-f762-11e1-a439-00145eb45e9a/verbatimHBaseExport-00000-of-00001.avro > > > as destination already exists and couldn't be deleted. > > > > 3. GBK operations that run over 500M small records consistently fail > > with OOM. We tried different configs with 48GB, 60GB, 80GB executor > > memory > > > > Our pipelines run are batch, simple transformations with either an > > HBaseSnapshot to Avro files or a merge of records in Avro (the GBK > > issue) pushed to ElasticSearch (it fails upstream of the > > ElasticsearchIO in the GBK stage). > > > > We notice operations that were mapToPair in 2.10.0 become repartition > > operations ( (mapToPair at GroupCombineFunctions.java:68 becomes > > repartition at GroupCombineFunctions.java:202)) which might be related > > to this and looks surprising. > > > > I'll report more as we learn. If anyone has any immediate ideas based > > on their commits or reviews or if you wish an tests run on other Beam > > versions please say. > > > > Thanks, > > Tim > > > > > > >
