Re: Batch processing with Hadoop -- does HDFS scale for parallel reads?

Brian Bockelman Wed, 04 Feb 2009 10:06:35 -0800

Hey TCK,

We use HDFS+FUSE solely as a storage solution for a application whichdoesn't understand MapReduce. We've scaled this solution to around80Gbps. For 300 processes reading from the same file, we get about20Gbps.

Do consider your data retention policies -- I would say that Hadoop asa storage system is thus far about 99% reliable for storage and is nota backup solution. If you're scared of getting more than 1% of yourlogs lost, have a good backup solution. I would also add that whenyou are learning your operational staff's abilities, expect even moredata loss. As you gain experience, data loss goes down.

I don't believe we've lost a single block in the last month, but ittook us 2-3 months of 1%-level losses to get here.


Brian

On Feb 4, 2009, at 11:51 AM, TCK wrote:

Hey guys,
We have been using Hadoop to do batch processing of logs. The logsget written and stored on a NAS. Our Hadoop cluster periodicallycopies a batch of new logs from the NAS, via NFS into Hadoop's HDFS,processes them, and copies the output back to the NAS. The HDFS iscleaned up at the end of each batch (ie, everything in it is deleted).
The problem is that reads off the NAS via NFS don't scale even if wetry to scale the copying process by adding more threads to read inparallel.
If we instead stored the log files on an HDFS cluster (instead ofNAS), it seems like the reads would scale since the data can be readfrom multiple data nodes at the same time without any contention(except network IO, which shouldn't be a problem).
I would appreciate if anyone could share any similar experience theyhave had with doing parallel reads from a storage HDFS.
Also is it a good idea to have a separate HDFS for storage vs fordoing the batch processing ?
Best Regards,
TCK

Re: Batch processing with Hadoop -- does HDFS scale for parallel reads?

Reply via email to