On Sun, Aug 7, 2011 at 12:49 PM, Ido Hadanny <ido.hada...@gmail.com> wrote:

> When you join tables which are distributed on the same key and used these
> key columns in the join condition, then each SPU (machine) in netezza works
> 100% independent of the other (see
> nz-interview<
> http://www.folkstalk.com/2011/06/netezza-interview-questions-part-2.html>
> ).
>
> In hive, there's bucketed map
> join<https://issues.apache.org/jira/browse/HIVE-917>,
> but the distribution of the files representing the tables to datanode is
> the
> responsibility of HDFS, it's not done according to hive CLUSTERED BY key!
>
> so suppose I have 2 tables, CLUSTERED BY the same key, and I join by that
> key - can hive get a guarantee from HDFS that matching buckets will sit on
> the same node? or will it always have to move the matching bucket of the
> small table to the datanode containing the big table bucket?
>
> Thanks, ido
> (Resend, since my last mail didn't reach the list)
>

To quickly answer your question no. Hadoop shards blocks round robin. You
can not control what is where.

On the flip side of this coin...

Disk locality in datacenter computing irrelevant
http://www.cs.berkeley.edu/~ganesha/disk-irrelevant_hotos2011.pdf

Having lived in the "real world" for a while now, i have many fundamental
disagreements with the above article. My largest problem is the assumption
that rack level switches can work at full GB on every port at the same time,
because not all switches have the backplane to handle their max capacity
from all ports and the ones that do cost "serious duckets".

There are two philosophies in extreme disagreement here. netezza and
physically locating data on the same hardware, or the HDFS shard at block
level across entire cluster.

Who's cool-aid is best ?

Reply via email to