On Tue, Feb 21, 2017 at 6:15 AM, Steve Loughran <ste...@hortonworks.com> wrote: > On 21 Feb 2017, at 01:00, Ryan Blue <rb...@netflix.com.INVALID> wrote: > > You'd have to encode the task ID in the output file name to identify files > > to roll back in the event you need to revert a task, but if you have > > partitioned output, you have to do a lot of directory listing to find all > > the files that need to be removed. That, or you could risk duplicate data > > by not rolling back tasks. > > Bear in mind that recursive directory listing isn't so expensive once you > have the O(1)-ish listFiles(files, recursive) operation of HADOOP-13208.
This doesn't work so well for us because we have an additional layer to provide atomic commits by swapping locations in the metastore. That causes areas in the prefix listing where we have old data mixed with new. You make a good point that it won't be quite so expensive to list an entire repository in the future, but I'd still rather avoid it. > . . . I'm choreographing the task and committer via data serialized to S3 > itself. On a task failure that will allow us to roll back all completely > written files, without the need for any task-job communications. I'm still > thinking about having an optional+async scan for pending commits to the dest > path, to identify problems and keep bills down. What do you mean by avoiding task-job communications? Don't you still need to communicate to the job commit what files the tasks produced? It sounds like you're using S3 for that instead of HDFS with the FileOutputCommitter like this does. Long-term, I'd probably agree that we shouldn't rely on HDFS, but it seems like a good way to get something working right now. > > The flaw in this approach is that you can still get partial writes if the > > driver fails while running the job committer, but it covers the other cases. > > There's a bit of that in both the FileOutputFormat and indeed, in > HadoopMapReduceCommitProtocol. It's just a small window, especially if you do > those final PUTs in parallel Yeah, it is unavoidable right now with the current table format. > > We're working on getting users moved over to the new committers, so now > > seems like a good time to get a copy out to the community. Please let me > > know what you think. > > I'll have a look at your code, see how it compares to mine. I'm able to take > advantage of the fact that we can tune the S3A FS, for example, by modifying > the block output stream to *not* commit its work in the final close() > > https://github.com/steveloughran/hadoop/blob/s3guard/HADOOP-13786-committer/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3ABlockOutputStream.java > > This means that provided the output writer doesn't attempt to read the file > it's just written, we can do a write straight to the final destination We're also considering an implementation that does exactly what we use the local FS for today, but keeps data in memory. When you close the stream, you'd get the same PendingUpload object we build from the file system. This is something we put off to get something out more quickly. Thanks for taking a look! --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org