Yeah, ec2's EBS and ephemeral storage are fine AFAIK. I just don't know
much anything at all about S3 (which might be why I'm inherently so
pessimistic about it working :P).
Dylan Hutchison wrote:
Hey Josh,
Are there other platforms on AWS (or another cloud provider) that
Accumulo/HDFS are friendly to run on? I thought I remembered you and
others running the agitation tests on Amazon instances during
release-testing time. If there are alternatives, what advantages would S3
have over the current method?
On Mon, Apr 25, 2016 at 8:09 AM, Josh Elser<[email protected]> wrote:
I'm not sure on the guarantees of s3 (much less the s3 or s3a Hadoop
FileSystem implementations), but, historically, the common issue is
lacking/incorrect implementations of sync(). For durability (read-as: not
losing your data), Accumulo *must* know that when it calls sync() on a
file, the data is persisted.
I don't know definitively what S3 guarantees (or asserts to guarantee),
but I would be very afraid until I ran some testing (we have one good test
in Accumulo that can run for days and verify data integrity called
continuous ingest).
You might have luck reaching out to the Hadoop community to get some
understanding from them about what can reasonably be expected with the
current S3 FileSystem implementations, and then run your own tests to make
sure that data is not lost.
vdelmeglio wrote:
Hi everyone,
I recently got this answer on stackoverflow (link:
http://stackoverflow.com/questions/36602719/accumulo-cluster-in-aws-with-s3-not-really-stable/36772874#36772874
):
Yes, I would expect that running Accumulo with S3 would result in
problems. Even though S3 has a FileSystem implementation, it does not
behave like a normal file system. Some examples of the differences are
that operations we would expect to be atomic are not atomic in S3,
exceptions may mean different things than we expect, and we assume our
view of files and their metadata is consistent rather than the eventual
consistency S3 provides.
It's possible these issues could be mitigated if we made some
modifications to the Accumulo code, but as far as I know no one has tried
running Accumulo on S3 to figure out the problems and whether those could
be fixed or not.
Since we're currently running an accumulo cluster on aws with s3 for
evaluation purpose, this answer make me wonder, should someone explain me
why running accumulo on s3 is not a good idea? in the specific, which
operations are expected to be atomic on accumulo?
Is there eventually a roadmap for s3 compatibility?
Thanks!
Valerio
--
View this message in context:
http://apache-accumulo.1065345.n5.nabble.com/Accumulo-on-s3-tp16737.html
Sent from the Developers mailing list archive at Nabble.com.