Hello HDFS users,

We are considering using Hadoop just as a clustered storage solution,
and I'm wondering if anyone has used it like this, and might have some
experiences or wisdom to share?

We need to distribute lots of large files over 30+ machines, and HDFS
seems to have all the right features, including replication, reacting
automatically to downed nodes, etc.  From a features point of view, it
seems to be a good fit, but I really want to know if this is backed up
by any real-world experience.

First concern I have: Some of our initial throughput tests show that
transferring files into and out of HDFS is noticeably slower than just a
straight copy to the machine would be... I was hoping the throughput
would be the same, or better in cases where my hadoop client machine can
talk to many datanodes at once.  Is this lower copy throughput expected,
or is there perhaps something I've failed to tune?

My other concern would be, what would happen if we set the default
replication to 2... I know 3 is the customary setting but we really need
to keep the costs down.  Does anyone have real-world experience with
maintaining a medium-sized farm with replication set to 2?  Anything to
watch out for?

Thanks for any feedback.  You can write me directly and I'll be happy to
summarize findings back to the list if there is interest.

gregc

Reply via email to