[ 
https://issues.apache.org/jira/browse/PIG-241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12601122#action_12601122
 ] 

Pi Song commented on PIG-241:
-----------------------------

I think we would have to work below the abstraction provided by Hadoop in order 
to achieve such optimization.  This would mean Hadoop has to support direct 
control over physical file placement in its APIs.

My suggestion:-

One possible optimization from distributed database textbooks is fragment-aware 
relational algebra. Files on HDFS are small chunks which are already natural 
fragments. If we could :-
 - Cluster same or close keys in the same set of chunks.
 - Map sets of chunks to sets of key ranges using Metadata.
Then we should be able to save a fair amount of unnecessary processing.

> Sharding and joins
> ------------------
>
>                 Key: PIG-241
>                 URL: https://issues.apache.org/jira/browse/PIG-241
>             Project: Pig
>          Issue Type: New Feature
>          Components: data
>            Reporter: John DeTreville
>
> Many large distributed systems for storage and computing over tables divide 
> these tables into smaller _shards,_ such that all rows with the same 
> (primary) key will appear in the same shard. If two tables are consistently 
> sharded, then they can be joined shard-by-shard. If corresponding shards are 
> stored on the same hosts (or racks), then joins can be performed locally on 
> those hosts without copying the rows of the tables over the network; this can 
> produce significant speedups.
> Pig does not currently provide application-controlled sharding and the 
> associated shard placement and computation placement. The performance of 
> joins therefore suffers in many scenarios; rows are passed over the network 
> multiple times when performing a join. If Pig (and Hadoop) could provide the 
> ability for the application to shard tables consistently, according to an 
> application-controlled policy, joins could be completely local operations and 
> could in many cases perform much better.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to