Thanks for reporting back Tim and great you found that it was a GC issue.

In the meantime I filled BEAM-8384 [1] for the
`spark.default.parallelism` issue. The Spark runner (both classic and
structured streaming) only in exceptional cases tries to improve the
default configuration values, but if the user overwrites a default
value the runner should respect this value otherwise users will be
confused and won’t be able to use well known Spark tunings like in
your case.

[1] https://issues.apache.org/jira/browse/BEAM-8384

On Tue, Oct 8, 2019 at 1:19 PM Tim Robertson <timrobertson...@gmail.com> wrote:
>
> I'm sorry for not replying. We are super busy trying to prepare data to 
> release.
>
> An update:
> - We were using G1GC and through slack were advised against that. This fixed 
> the OOM error we saw and all our 2.15.0 jobs did complete
>
> When we have time (after 3 weeks) I'll try and isolate a test case with the 
> reshuffle example and parallelism.
>
> Thanks,
> Tim
>
>
> On Thu, Oct 3, 2019 at 1:21 PM Jan Lukavský <je...@seznam.cz> wrote:
>>
>> Hi Tim,
>>
>> can you please elaborate more about some parts?
>>
>> 1) What happens actually in your case? What is the specific settings you
>> use?
>>
>> 3) Can you share stacktrace? Is it always the same, or does it change?
>>
>> The mentioned GroupCombineFunctions.java:202 comes from a Reshuffle,
>> which seems to make a little sense to me regarding the logic you
>> described. Do you use Reshuffle transform or does it expand from some
>> other transform?
>>
>> Jan
>>
>> On 10/3/19 9:24 AM, Tim Robertson wrote:
>> > Hi all,
>> >
>> > We haven't dug enough into this to know where to log issues, but I'll
>> > start by sharing here.
>> >
>> > After upgrading from Beam 2.10.0 to 2.15.0 we see issues on
>> > SparkRunner - we suspect all of this related.
>> >
>> > 1. spark.default.parallelism is not respected
>> >
>> > 2. File writing (Avro) with dynamic destinations (grouped into folders
>> > by a field name) consistently fail with
>> > org.apache.beam.sdk.util.UserCodeException:
>> > java.nio.file.FileAlreadyExistsException: Unable to rename resource
>> > hdfs://ha-nn/pipelines/export-20190930-0854/.temp-beam-d4fd89ed-fc7a-4b1e-aceb-68f9d72d50f0/6e086f60-8bda-4d0e-b29d-1b47fdfc88c0
>> > to
>> > hdfs://ha-nn/pipelines/export-20190930-0854/7c9d2aec-f762-11e1-a439-00145eb45e9a/verbatimHBaseExport-00000-of-00001.avro
>> > as destination already exists and couldn't be deleted.
>> >
>> > 3. GBK operations that run over 500M small records consistently fail
>> > with OOM. We tried different configs with 48GB, 60GB, 80GB executor
>> > memory
>> >
>> > Our pipelines run are batch, simple transformations with either an
>> > HBaseSnapshot to Avro files or a merge of records in Avro (the GBK
>> > issue) pushed to ElasticSearch (it fails upstream of the
>> > ElasticsearchIO in the GBK stage).
>> >
>> > We notice operations that were mapToPair  in 2.10.0 become repartition
>> > operations ( (mapToPair at GroupCombineFunctions.java:68 becomes
>> > repartition at GroupCombineFunctions.java:202)) which might be related
>> > to this and looks surprising.
>> >
>> > I'll report more as we learn. If anyone has any immediate ideas based
>> > on their commits or reviews or if you wish an tests run on other Beam
>> > versions please say.
>> >
>> > Thanks,
>> > Tim
>> >
>> >
>> >

Reply via email to