Re: Output Committers for S3

Steve Loughran Tue, 20 Jun 2017 14:18:42 -0700

> On 20 Jun 2017, at 07:49, sririshindra <sririshin...@gmail.com> wrote:
> 
> Is there anything similar to s3 connector for Google cloud storage?
> Since Google cloud Storage is also an object store rather than a file
> system, I imagine the same problem that the s3 connector is trying to solve
> arises with google cloud storage as well.
> 
> Thanks,
> rishi
>

That's google's problem for now.

S3 has some specific issues

1. there's no rename. The FileSystem.rename() command is mocked by LIST, COPY &
DELETE, so takes O(data). This is the slow bit people complain about.
2. it has list inconsistency, so that LIST may actually miss the data. This is
bit people should be worrying about.

The now deleted DirectOutputCommitter didn't use rename, so it avoided both
issues, it just didn't handle failures or speculation. In the absence of
failures, all the data did end up in the right place, whereas list
inconsistency in rename() means that you may have unobserved data loss. That's
the big problem.

Azure WASB has fast atomic rename, so doesn't have the specific S3 problem.

The work I'm doing in Hadoop for committers (HADOOP-13786) is designed to make
it possible to put in different committers under the FIleOutputFormat data
writer, without the things above worrying; if the other blobstores need a new
commit algorithm, it will be more straightforward. And the tests I'm doing will
be mostly written to be retargeted from the outset.

Finally, quick talk from last week about filesystems, posix, object stores and
NVM, how the commit problems get pulled both ways. If your RAM persists, you'd
better be doing atomic record updates and making sure that when you write back
something you want persisted, it'd better not be cached by the CPU

https://www.youtube.com/watch?v=UOE2m_XUr3U&feature=youtu.be&list=PLq-odUc2x7i-9Nijx-WfoRMoAfHC9XzTt

-Steve

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Output Committers for S3

Reply via email to