On Tue, Feb 21, 2017 at 6:15 AM, Steve Loughran <ste...@hortonworks.com> wrote:
> On 21 Feb 2017, at 01:00, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> > You'd have to encode the task ID in the output file name to identify files 
> > to roll back in the event you need to revert a task, but if you have 
> > partitioned output, you have to do a lot of directory listing to find all 
> > the files that need to be removed. That, or you could risk duplicate data 
> > by not rolling back tasks.
>
> Bear in mind that recursive directory listing isn't so expensive once you 
> have the O(1)-ish listFiles(files, recursive) operation of HADOOP-13208.

This doesn't work so well for us because we have an additional layer
to provide atomic commits by swapping locations in the metastore. That
causes areas in the prefix listing where we have old data mixed with
new. You make a good point that it won't be quite so expensive to list
an entire repository in the future, but I'd still rather avoid it.

> . . . I'm choreographing the task and committer via data serialized to S3 
> itself. On a task failure that will allow us to roll back all completely 
> written files, without the need for any task-job communications. I'm still 
> thinking about having an optional+async scan for pending commits to the dest 
> path, to identify problems and keep bills down.

What do you mean by avoiding task-job communications? Don't you still
need to communicate to the job commit what files the tasks produced?
It sounds like you're using S3 for that instead of HDFS with the
FileOutputCommitter like this does. Long-term, I'd probably agree that
we shouldn't rely on HDFS, but it seems like a good way to get
something working right now.

> > The flaw in this approach is that you can still get partial writes if the 
> > driver fails while running the job committer, but it covers the other cases.
>
> There's a bit of that in both the FileOutputFormat and indeed, in 
> HadoopMapReduceCommitProtocol. It's just a small window, especially if you do 
> those final PUTs in parallel

Yeah, it is unavoidable right now with the current table format.

> > We're working on getting users moved over to the new committers, so now 
> > seems like a good time to get a copy out to the community. Please let me 
> > know what you think.
>
> I'll have a look at your code, see how it compares to mine. I'm able to take 
> advantage of the fact that we can tune the S3A FS, for example, by modifying 
> the block output stream to *not* commit its work in the final close()
>
> https://github.com/steveloughran/hadoop/blob/s3guard/HADOOP-13786-committer/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3ABlockOutputStream.java
>
> This means that provided the output writer doesn't attempt to read the file 
> it's just written, we can do a write straight to the final destination

We're also considering an implementation that does exactly what we use
the local FS for today, but keeps data in memory. When you close the
stream, you'd get the same PendingUpload object we build from the file
system. This is something we put off to get something out more
quickly.

Thanks for taking a look!

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to