Re: Optimal setup for a test problem

Andrew Nguyen Tue, 13 Apr 2010 11:41:23 -0700

Good to know...  The problem is that I'm in an academic environment that
needs a lot of convincing regarding new computational technologies.  I need
to show proven benefit before getting the funds to actually implement
anything.  These servers were the best I could come up with for this
proof-of-concept.


I changed some settings on the nodes and have been experimenting - and I'm
seeing about 3.4 mb/sec with TestDFSIO which is pretty consistent with your
observations below.

Given that, would increasing the block sizes help my performance?  This
should result in fewer map jobs and keeping the computation locally,
longer...?  I just need to show that the numbers are better than a single
machine, even if sacrificing redundancy (or other factors) in the current
setup.

@alex:

Thanks for the links, it gives me another bit of evidence to convince
those controlling the money flow...

--Andrew

On Tue, 13 Apr 2010 10:29:06 -0700, Todd Lipcon <t...@cloudera.com> wrote:
> On Mon, Apr 12, 2010 at 1:45 PM, Andrew Nguyen <
> andrew-lists-had...@ucsfcti.org> wrote:
> 
>> I don't think you can :-).  Sorry, they are 100Mbps NIC's...  I get
>> 95Mbit/sec from one node to another with iperf.
>>
>> Should I still be expecting such dismal performance with just 100Mbps?
>>
> 
> Yes - in my experience on gigabit, when lots of transfers are going
between
> the nodes, TCP performance actually drops to around half the network
> capacity. In the case of 100Mbps, this is probably going to be around
> 5MB/sec
> 
> So when you're writing output at 3x replication, it's going to be very
very
> slow on this network.
> 
> -Todd
> 
> 
>>
>> On Apr 12, 2010, at 1:31 PM, Todd Lipcon wrote:
>>
>> > On Mon, Apr 12, 2010 at 1:05 PM, Andrew Nguyen <
>> > andrew-lists-had...@ucsfcti.org> wrote:
>> >
>> >> 5 identically spec'ed nodes, each has:
>> >>
>> >> 2 GB RAM
>> >> Pentium 4 3.0G with HT
>> >> 250GB HDD on PATA
>> >> 10Mbps NIC
>> >>
>> >
>> > This is probably your issue - 10mbps nic? I didn't know you could
even
>> get
>> > those anymore!
>> >
>> > Hadoop runs on commodity hardware, but you're not likely to get
>> reasonable
>> > performance with hardware like that.
>> >
>> > -Todd
>> >
>> >
>> >> On Apr 12, 2010, at 11:58 AM, alex kamil wrote:
>> >>
>> >>> Andrew,
>> >>>
>> >>> I would also suggest to run DFSIO benchmark to isolate io related
>> issues
>> >>>
>> >>> hadoop jar hadoop-0.20.2-test.jar TestDFSIO -write -nrFiles 10
>> -fileSize
>> >> 1000
>> >>> hadoop jar hadoop-0.20.2-test.jar TestDFSIO -read -nrFiles 10
>> >>> -fileSize
>> >> 1000
>> >>>
>> >>> there are additional tests specific for mapreduce -  run  "hadoop
jar
>> >> hadoop-0.20.2-test.jar" for the complete list
>> >>>
>> >>> 45 min for mapping 6GB on 5 nodes is way too high assuming your
>> >> gain/offset conversion is a simple algebraic manipulation
>> >>>
>> >>> it takes less than 5 min  to run a simple mapper (using streaming)
>> >>> on a
>> >> 4 nodes cluster on something like 10GB, the mapper i used was an awk
>> command
>> >> extracting <key:value> pair from a log (no reducer)
>> >>>
>> >>> Thanks
>> >>> Alex
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Mon, Apr 12, 2010 at 1:53 PM, Todd Lipcon <t...@cloudera.com>
>> wrote:
>> >>> Hi Andrew,
>> >>>
>> >>> Do you need the sorting behavior that having an identity reducer
>> >>> gives
>> >> you?
>> >>> If not, set the number of reduce tasks to 0 and you'll end up with
a
>> map
>> >>> only job, which should be significantly faster.
>> >>>
>> >>> -Todd
>> >>>
>> >>> On Mon, Apr 12, 2010 at 9:43 AM, Andrew Nguyen <
>> >>> andrew-lists-had...@ucsfcti.org> wrote:
>> >>>
>> >>>> Hello,
>> >>>>
>> >>>> I recently setup a 5 node cluster (1 master, 4 slaves) and am
>> >>>> looking
>> >> to
>> >>>> use it to process high volumes of patient physiologic data.  As an
>> >> initial
>> >>>> exercise to gain a better understanding, I have attempted to run
the
>> >>>> following problem (which isn't the type of problem that Hadoop was
>> >> really
>> >>>> designed for, as is my understanding).
>> >>>>
>> >>>> I have a 6G data file, that contains key/value of <sample number,
>> >> sample
>> >>>> value>.  I'd like to convert the values based on a gain/offset to
>> their
>> >>>> physical units.  I've setup a MapReduce job using streaming where
>> >>>> the
>> >> mapper
>> >>>> does the conversion, and the reducer is just an identity reducer.
>> >> Based on
>> >>>> other threads on the mailing list, my initial results are
consistent
>> in
>> >> the
>> >>>> fact that it takes considerably more time to process this in
Hadoop
>> >> then it
>> >>>> is on my Macbook pro (45 minutes vs. 13 minutes).  The input is a
>> >> single 6G
>> >>>> file and it looks like the file is being split into 101 map tasks.
>> >> This is
>> >>>> consistent with the 64M block sizes.
>> >>>>
>> >>>> So my questions are:
>> >>>>
>> >>>> * Would it help to increase the block size to 128M?  Or, decrease
>> >>>> the
>> >> block
>> >>>> size?  What are some key factors to think about with this
question?
>> >>>> * Are there any other optimizations that I could employ?  I have
>> looked
>> >>>> into LzoCompression but I'd like to still work without compression
>> >> since the
>> >>>> single thread job that I'm comparing to doesn't use any sort of
>> >> compression.
>> >>>> I know I'm comparing apples to pears a little here so please feel
>> >>>> free
>> >> to
>> >>>> correct this assumption.
>> >>>> * Is Hadoop really only good for jobs where the data doesn't fit
on
>> >>>> a
>> >>>> single node?  At some level, I assume that it can still speedup
jobs
>> >> that do
>> >>>> fit on one node, if only because you are performing tasks in
>> >>>> parallel.
>> >>>>
>> >>>> Thanks!
>> >>>>
>> >>>> --Andrew
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Todd Lipcon
>> >>> Software Engineer, Cloudera
>> >>>
>> >>
>> >>
>> >
>> >
>> > --
>> > Todd Lipcon
>> > Software Engineer, Cloudera
>>
>>

Re: Optimal setup for a test problem

Reply via email to