We effectively have this situation on a significant fraction of our
work-load as well.  Much of our data is summarized hourly and is encrypted
and compressed which makes it unsplittable.  This means that the map
processes are often not local to the data since the data is typically spread
only to two or three datanodes.  The result is that locality drops from the
typical 80-90% to about 20-30% on a small cluster.

The result is measurably lower performance, but the difference is dominated
by the limited parallelism and cost of decryption.

Also, when people talk about "not decent" disks, they are often referring
primarily to seek time and rotational latency in random access job mixes.
Even pretty poor disks are useful with map-reduce because so much of the
code reads large chunks of the disk in highly sequential order.  Moreover,
even a poor disk will add to the overall aggregate read and write bandwidth
of the cluster ... One or another map task might take a bit longer to run a
disk bound job, but the overall result should still be better than not
having the disk and node.


On 1/20/08 8:29 AM, "Allen Wittenauer" <[EMAIL PROTECTED]> wrote:

> 
> 
> 
> On 1/18/08 3:29 PM, "Jason Venner" <[EMAIL PROTECTED]> wrote:
>> We were thinking of doing this with some machines that do not have
>> decent disks but have plenty of netbandwidth.
> 
> 
>     We were doing it for a while, in particular for our data loaders*....
> but that was months and months ago.
> 
>     I don't remember any specific complaints about speed, but you obviously
> lose some of the data locality capabilities.  Depending upon your workload
> and how your network is configured (bandwidth -and- latency), you might be
> ok.
> 
> 
> * - this is to force the data to get spread out amongst the data nodes vs.
> filling up the node that the data is being loaded from.
> 

Reply via email to