Re: Batch processing with Hadoop -- does HDFS scale for parallel reads?

Brian Bockelman Wed, 04 Feb 2009 10:51:00 -0800

Sounds overly complicated.  Complicated usually leads to mistakes :)

What about just having a single cluster and only running thetasktrackers on the fast CPUs? No messy cross-cluster transferring.


Brian

On Feb 4, 2009, at 12:46 PM, TCK wrote:

Thanks, Brian. This sounds encouraging for us.
What are the advantages/disadvantages of keeping a persistentstorage (HD/K)FS cluster separate from a processing Hadoop+(HD/K)FScluster ?The advantage I can think of is that a permanent storage cluster hasdifferent requirements from a map-reduce processing cluster -- thepermanent storage cluster would need faster, bigger hard disks, andwould need to grow as the total volume of all collected logs grows,whereas the processing cluster would need fast CPUs and would onlyneed to grow with the rate of incoming data. So it seems to makesense to me to copy a piece of data from the permanent storagecluster to the processing cluster only when it needs to beprocessed. Is my line of thinking reasonable? How would this compareto running the map-reduce processing on same cluster as the data isstored in? Which approach is used by most people?
Best Regards,
TCK



--- On Wed, 2/4/09, Brian Bockelman <[email protected]> wrote:
From: Brian Bockelman <[email protected]>
Subject: Re: Batch processing with Hadoop -- does HDFS scale forparallel reads?
To: [email protected]
Date: Wednesday, February 4, 2009, 1:06 PM

Hey TCK,

We use HDFS+FUSE solely as a storage solution for a application which
doesn't understand MapReduce.  We've scaled this solution to around
80Gbps. For 300 processes reading from the same file, we get about20Gbps.
Do consider your data retention policies -- I would say that Hadoopas astorage system is thus far about 99% reliable for storage and is nota backupsolution. If you're scared of getting more than 1% of your logslost, havea good backup solution. I would also add that when you are learningyouroperational staff's abilities, expect even more data loss. As yougain
experience, data loss goes down.

I don't believe we've lost a single block in the last month, but it
took us 2-3 months of 1%-level losses to get here.

Brian

On Feb 4, 2009, at 11:51 AM, TCK wrote:
Hey guys,
We have been using Hadoop to do batch processing of logs. The logsget
written and stored on a NAS. Our Hadoop cluster periodically copiesa batch of
new logs from the NAS, via NFS into Hadoop's HDFS, processes them, and
copies the output back to the NAS. The HDFS is cleaned up at the endof each
batch (ie, everything in it is deleted).
The problem is that reads off the NAS via NFS don't scale even if we
try to scale the copying process by adding more threads to read inparallel.
If we instead stored the log files on an HDFS cluster (instead ofNAS), it
seems like the reads would scale since the data can be read frommultiple datanodes at the same time without any contention (except network IO,which
shouldn't be a problem).
I would appreciate if anyone could share any similar experiencethey have
had with doing parallel reads from a storage HDFS.
Also is it a good idea to have a separate HDFS for storage vs fordoing
the batch processing ?
Best Regards,
TCK

Re: Batch processing with Hadoop -- does HDFS scale for parallel reads?

Reply via email to