Re: Batch processing with Hadoop -- does HDFS scale for parallel reads?

Brian Bockelman Fri, 06 Feb 2009 09:25:43 -0800


On Feb 6, 2009, at 11:00 AM, TCK wrote:

How well does the read throughput from HDFS scale with the number ofdata nodes ?For example, if I had a large file (say 10GB) on a 10 data nodecluster, would the time taken to read this whole file in parallel(ie, with multiple reader client processes requesting differentparts of the file in parallel) be halved if I had the same file on a20 data node cluster ?

Possibly. (I don't give a firm answer because the answer depends onthe number of chunks and the number of replicas).

If there are enough replicas and enough separate reading processeswith enough network bandwidth, then yes, your read bandwidth coulddouble.

Is this not possible because HDFS doesn't support random seeks?


It does for reads.  It does not for writes.

Trust me, our physicists have what can best be described as "the mostgod-awful random read patterns you've seen in your life" and they dofine on HDFS.

What about if the file was split up into multiple smaller filesbefore placing in the HDFS ?


Then things would be less efficient and you'd be less likely to scale.

Brian

Thanks for your input.
-TCK




--- On Wed, 2/4/09, Brian Bockelman <bbock...@cse.unl.edu> wrote:
From: Brian Bockelman <bbock...@cse.unl.edu>
Subject: Re: Batch processing with Hadoop -- does HDFS scale forparallel reads?
To: core-user@hadoop.apache.org
Date: Wednesday, February 4, 2009, 1:50 PM

Sounds overly complicated.  Complicated usually leads to mistakes :)
What about just having a single cluster and only running thetasktrackers on
the fast CPUs?  No messy cross-cluster transferring.

Brian

On Feb 4, 2009, at 12:46 PM, TCK wrote:
Thanks, Brian. This sounds encouraging for us.

What are the advantages/disadvantages of keeping a persistent storage
(HD/K)FS cluster separate from a processing Hadoop+(HD/K)FS cluster ?
The advantage I can think of is that a permanent storage cluster has
different requirements from a map-reduce processing cluster -- thepermanentstorage cluster would need faster, bigger hard disks, and would needto grow asthe total volume of all collected logs grows, whereas the processingclusterwould need fast CPUs and would only need to grow with the rate ofincoming data.So it seems to make sense to me to copy a piece of data from thepermanentstorage cluster to the processing cluster only when it needs to beprocessed. Ismy line of thinking reasonable? How would this compare to runningthe map-reduceprocessing on same cluster as the data is stored in? Which approachis used by
most people?
Best Regards,
TCK



--- On Wed, 2/4/09, Brian Bockelman <bbock...@cse.unl.edu> wrote:
From: Brian Bockelman <bbock...@cse.unl.edu>
Subject: Re: Batch processing with Hadoop -- does HDFS scale forparallel
reads?
To: core-user@hadoop.apache.org
Date: Wednesday, February 4, 2009, 1:06 PM

Hey TCK,

We use HDFS+FUSE solely as a storage solution for a application which
doesn't understand MapReduce.  We've scaled this solution to
around
80Gbps.  For 300 processes reading from the same file, we get about
20Gbps.
Do consider your data retention policies -- I would say that Hadoopas astorage system is thus far about 99% reliable for storage and isnot a
backup
solution. If you're scared of getting more than 1% of your logslost,
have
a good backup solution. I would also add that when you arelearning your
operational staff's abilities, expect even more data loss.  As you
gain
experience, data loss goes down.

I don't believe we've lost a single block in the last month, but
it
took us 2-3 months of 1%-level losses to get here.

Brian

On Feb 4, 2009, at 11:51 AM, TCK wrote:
Hey guys,
We have been using Hadoop to do batch processing of logs. The logsget
written and stored on a NAS. Our Hadoop cluster periodically copies a
batch of
new logs from the NAS, via NFS into Hadoop's HDFS, processes them,andcopies the output back to the NAS. The HDFS is cleaned up at theend of
each
batch (ie, everything in it is deleted).
The problem is that reads off the NAS via NFS don't scale even if
we
try to scale the copying process by adding more threads to read in
parallel.
If we instead stored the log files on an HDFS cluster (instead of
NAS), it
seems like the reads would scale since the data can be read frommultiple
data
nodes at the same time without any contention (except network IO,which
shouldn't be a problem).
I would appreciate if anyone could share any similar experience they
have
had with doing parallel reads from a storage HDFS.
Also is it a good idea to have a separate HDFS for storage vs for
doing
the batch processing ?
Best Regards,
TCK

Re: Batch processing with Hadoop -- does HDFS scale for parallel reads?

Reply via email to