v1 on gcs isn't safe either as promotion from task attempt to successful task is a dir rename; fast and atomic on hdfs, O(files) and nonatomic on GCS.
if i can get that hadoop 3.3.5 rc out soon, the manifest committer will be there to test https://issues.apache.org/jira/browse/MAPREDUCE-7341 until then, as chris says, turn off speculative execution On Fri, 21 Oct 2022 at 23:39, Chris Nauroth <cnaur...@apache.org> wrote: > Some users have observed issues like what you're describing related to the > job commit algorithm, which is controlled by configuration > property spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version. > Hadoop's default value for this setting is 2. You can find a description of > the algorithms in Hadoop's configuration documentation: > > > https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml > > Algorithm version 2 is faster, because the final task output file renames > can be issued in parallel by individual tasks. Unfortunately, there have > been reports of it causing side effects like what you described, especially > if there are a lot of task attempt retries or speculative execution > (configuration property spark.speculation set to true instead of the > default false). You could try switching to algorithm version 1. The > drawback is that it's slower, because the final output renames are executed > single-threaded at the end of the job. The performance impact is more > noticeable for jobs with many tasks, and the effect is amplified when using > cloud storage as opposed to HDFS running in the same network. > > If you are using speculative execution, then you could also potentially > try turning that off. > > Chris Nauroth > > > On Wed, Oct 19, 2022 at 8:18 AM Martin Andersson < > martin.anders...@kambi.com> wrote: > >> Is your spark job batch or streaming? >> ------------------------------ >> *From:* Sandeep Vinayak <vnayak...@gmail.com> >> *Sent:* Tuesday, October 18, 2022 19:48 >> *To:* dev@spark.apache.org <dev@spark.apache.org> >> *Subject:* Missing data in spark output >> >> >> EXTERNAL SENDER. Do not click links or open attachments unless you >> recognize the sender and know the content is safe. DO NOT provide your >> username or password. >> >> Hello Everyone, >> >> We are recently observing an intermittent data loss in the spark with >> output to GCS (google cloud storage). When there are missing rows, they are >> accompanied by duplicate rows. The re-run of the job doesn't have any >> duplicate or missing rows. Since it's hard to debug, we are first trying to >> understand the potential theoretical root cause of this issue, can this be >> a GCS specific issue where GCS might not be handling the consistencies >> well? Any tips will be super helpful. >> >> Thanks, >> >>