> On 20 Jun 2017, at 07:49, sririshindra <sririshin...@gmail.com> wrote:
> 
> Is there anything similar to s3 connector for Google cloud storage?
> Since Google cloud Storage is also an object store rather than a file
> system, I imagine the same problem that the s3 connector is trying to solve
> arises with google cloud storage as well.
> 
> Thanks,
> rishi
> 

That's google's problem for now. 

S3 has some specific issues

1. there's no rename. The FileSystem.rename() command is mocked by LIST, COPY & 
DELETE, so takes O(data). This is the slow bit people complain about.
2. it has list inconsistency, so that LIST may actually miss the data. This is 
bit people should be worrying about.

The now deleted DirectOutputCommitter didn't use rename, so it avoided both 
issues, it just didn't handle failures or speculation. In the absence of 
failures, all the data did end up in the right place, whereas list 
inconsistency in rename() means that you may have unobserved data loss. That's 
the big problem.

Azure WASB has fast atomic rename, so doesn't have the specific S3 problem.

The work I'm doing in Hadoop for committers (HADOOP-13786) is designed to make 
it possible to put in different committers under the FIleOutputFormat data 
writer, without the things above worrying; if the other blobstores need a new 
commit algorithm, it will be more straightforward. And the tests I'm doing will 
be mostly written to be retargeted from the outset.

Finally, quick talk from last week about filesystems, posix, object stores and 
NVM, how the commit problems get pulled both ways. If your RAM persists, you'd 
better be doing atomic record updates and making sure that when you write back 
something you want persisted, it'd better not be cached by the CPU

https://www.youtube.com/watch?v=UOE2m_XUr3U&feature=youtu.be&list=PLq-odUc2x7i-9Nijx-WfoRMoAfHC9XzTt

-Steve


---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to