Re: Missing data in spark output

Steve Loughran Tue, 25 Oct 2022 11:06:43 -0700

v1 on gcs isn't safe either as promotion from task attempt to
successful task is a dir rename; fast and atomic on hdfs, O(files) and
nonatomic on GCS.


if i can get that hadoop 3.3.5 rc out soon, the manifest committer will be
there to test  https://issues.apache.org/jira/browse/MAPREDUCE-7341

until then, as chris says, turn off speculative execution

On Fri, 21 Oct 2022 at 23:39, Chris Nauroth <cnaur...@apache.org> wrote:

> Some users have observed issues like what you're describing related to the
> job commit algorithm, which is controlled by configuration
> property spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version.
> Hadoop's default value for this setting is 2. You can find a description of
> the algorithms in Hadoop's configuration documentation:
>
>
> https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
>
> Algorithm version 2 is faster, because the final task output file renames
> can be issued in parallel by individual tasks. Unfortunately, there have
> been reports of it causing side effects like what you described, especially
> if there are a lot of task attempt retries or speculative execution
> (configuration property spark.speculation set to true instead of the
> default false). You could try switching to algorithm version 1. The
> drawback is that it's slower, because the final output renames are executed
> single-threaded at the end of the job. The performance impact is more
> noticeable for jobs with many tasks, and the effect is amplified when using
> cloud storage as opposed to HDFS running in the same network.
>
> If you are using speculative execution, then you could also potentially
> try turning that off.
>
> Chris Nauroth
>
>
> On Wed, Oct 19, 2022 at 8:18 AM Martin Andersson <
> martin.anders...@kambi.com> wrote:
>
>> Is your spark job batch or streaming?
>> ------------------------------
>> *From:* Sandeep Vinayak <vnayak...@gmail.com>
>> *Sent:* Tuesday, October 18, 2022 19:48
>> *To:* dev@spark.apache.org <dev@spark.apache.org>
>> *Subject:* Missing data in spark output
>>
>>
>> EXTERNAL SENDER. Do not click links or open attachments unless you
>> recognize the sender and know the content is safe. DO NOT provide your
>> username or password.
>>
>> Hello Everyone,
>>
>> We are recently observing an intermittent data loss in the spark with
>> output to GCS (google cloud storage). When there are missing rows, they are
>> accompanied by duplicate rows. The re-run of the job doesn't have any
>> duplicate or missing rows. Since it's hard to debug, we are first trying to
>> understand the potential theoretical root cause of this issue, can this be
>> a GCS specific issue where GCS might not be handling the consistencies
>> well? Any tips will be super helpful.
>>
>> Thanks,
>>
>>

Re: Missing data in spark output

Reply via email to