On Feb 4, 2011, at 07:46 , Keith Wiley wrote:

> On Feb 3, 2011, at 6:41 PM, Allen Wittenauer wrote:
> 
>>      If all the task slots are in use, why would you care if they are 
>> queueing up?  Also keep in mind that if a node fails, that work will need to 
>> get re-done anyway.
> 
> 
> Because all slots are not in use.  It's a very larger cluster and it's 
> excruciating that Hadoop partially serializes a job by piling multiple map 
> tasks onto a single map in a queue even when the cluster is massively 
> underutilized.  This occurs when the input records are significantly smaller 
> than the block size (6MB vs 64MB in my case, give me about a 32x 
> serialization cost!!!).  To put it differently, if I let Hadoop do it its own 
> stupid way, the job takes 32 times longer than it should take if it evenly 
> distributed the map tasks across the nodes.  Packing the input files into 
> larger sequence fils does not help with this problem.  The input splits are 
> calculated from the individual files and thus, I still get this undesirable 
> packing effect.


Having reread my last paragraph, I am now reconsidering its tone.  I apologize. 
 I am entirely open to the possibility that there are smarter ways to achieve 
my desired goal of minimum job-turnaround time (maximum parallelism), perhaps 
via various configuration parameters which I have not learned how to use 
properly...and furthermore I am willing to admit that the seemingly frustrating 
and seemingly illogical partial serialism that I witnessed in my jobs using 
Hadoop's default configuration was not necessarily Hadoop's fault but rather 
originated from some ineptitude on my part w.r.t. configuring, programming, and 
using Hadoop properly.

In other words, I am perfectly willing to admit I might just not be using 
Hadoop correctly and that this problem is therefore basically my fault.

Sorry.

________________________________________________________________________________
Keith Wiley               [email protected]               www.keithwiley.com

"You can scratch an itch, but you can't itch a scratch. Furthermore, an itch can
itch but a scratch can't scratch. Finally, a scratch can itch, but an itch can't
scratch. All together this implies: He scratched the itch from the scratch that
itched but would never itch the scratch from the itch that scratched."
  -- Keith Wiley
________________________________________________________________________________



Reply via email to