Hey Josh, Are there other platforms on AWS (or another cloud provider) that Accumulo/HDFS are friendly to run on? I thought I remembered you and others running the agitation tests on Amazon instances during release-testing time. If there are alternatives, what advantages would S3 have over the current method?
On Mon, Apr 25, 2016 at 8:09 AM, Josh Elser <[email protected]> wrote: > I'm not sure on the guarantees of s3 (much less the s3 or s3a Hadoop > FileSystem implementations), but, historically, the common issue is > lacking/incorrect implementations of sync(). For durability (read-as: not > losing your data), Accumulo *must* know that when it calls sync() on a > file, the data is persisted. > > I don't know definitively what S3 guarantees (or asserts to guarantee), > but I would be very afraid until I ran some testing (we have one good test > in Accumulo that can run for days and verify data integrity called > continuous ingest). > > You might have luck reaching out to the Hadoop community to get some > understanding from them about what can reasonably be expected with the > current S3 FileSystem implementations, and then run your own tests to make > sure that data is not lost. > > > vdelmeglio wrote: > >> Hi everyone, >> >> I recently got this answer on stackoverflow (link: >> >> http://stackoverflow.com/questions/36602719/accumulo-cluster-in-aws-with-s3-not-really-stable/36772874#36772874 >> ): >> >> >> Yes, I would expect that running Accumulo with S3 would result in >>> problems. Even though S3 has a FileSystem implementation, it does not >>> behave like a normal file system. Some examples of the differences are >>> that operations we would expect to be atomic are not atomic in S3, >>> exceptions may mean different things than we expect, and we assume our >>> view of files and their metadata is consistent rather than the eventual >>> consistency S3 provides. >>> >>> It's possible these issues could be mitigated if we made some >>> modifications to the Accumulo code, but as far as I know no one has tried >>> running Accumulo on S3 to figure out the problems and whether those could >>> be fixed or not. >>> >> >> Since we're currently running an accumulo cluster on aws with s3 for >> evaluation purpose, this answer make me wonder, should someone explain me >> why running accumulo on s3 is not a good idea? in the specific, which >> operations are expected to be atomic on accumulo? >> >> Is there eventually a roadmap for s3 compatibility? >> >> Thanks! >> Valerio >> >> >> >> -- >> View this message in context: >> http://apache-accumulo.1065345.n5.nabble.com/Accumulo-on-s3-tp16737.html >> Sent from the Developers mailing list archive at Nabble.com. >> >
