> On 20 Jun 2017, at 07:49, sririshindra <sririshin...@gmail.com> wrote: > > Is there anything similar to s3 connector for Google cloud storage? > Since Google cloud Storage is also an object store rather than a file > system, I imagine the same problem that the s3 connector is trying to solve > arises with google cloud storage as well. > > Thanks, > rishi >
That's google's problem for now. S3 has some specific issues 1. there's no rename. The FileSystem.rename() command is mocked by LIST, COPY & DELETE, so takes O(data). This is the slow bit people complain about. 2. it has list inconsistency, so that LIST may actually miss the data. This is bit people should be worrying about. The now deleted DirectOutputCommitter didn't use rename, so it avoided both issues, it just didn't handle failures or speculation. In the absence of failures, all the data did end up in the right place, whereas list inconsistency in rename() means that you may have unobserved data loss. That's the big problem. Azure WASB has fast atomic rename, so doesn't have the specific S3 problem. The work I'm doing in Hadoop for committers (HADOOP-13786) is designed to make it possible to put in different committers under the FIleOutputFormat data writer, without the things above worrying; if the other blobstores need a new commit algorithm, it will be more straightforward. And the tests I'm doing will be mostly written to be retargeted from the outset. Finally, quick talk from last week about filesystems, posix, object stores and NVM, how the commit problems get pulled both ways. If your RAM persists, you'd better be doing atomic record updates and making sure that when you write back something you want persisted, it'd better not be cached by the CPU https://www.youtube.com/watch?v=UOE2m_XUr3U&feature=youtu.be&list=PLq-odUc2x7i-9Nijx-WfoRMoAfHC9XzTt -Steve --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org