Sharding and joins
------------------

                 Key: PIG-241
                 URL: https://issues.apache.org/jira/browse/PIG-241
             Project: Pig
          Issue Type: New Feature
          Components: data
            Reporter: John DeTreville


Many large distributed systems for storage and computing over tables divide 
these tables into smaller _shards,_ such that all rows with the same (primary) 
key will appear in the same shard. If two tables are consistently sharded, then 
they can be joined shard-by-shard. If corresponding shards are stored on the 
same hosts (or racks), then joins can be performed locally on those hosts 
without copying the rows of the tables over the network; this can produce 
significant speedups.

Pig does not currently provide application-controlled sharding and the 
associated shard placement and computation placement. The performance of joins 
therefore suffers in many scenarios; rows are passed over the network multiple 
times when performing a join. If Pig (and Hadoop) could provide the ability for 
the application to shard tables consistently, according to an 
application-controlled policy, joins could be completely local operations and 
could in many cases perform much better.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to