[jira] [Commented] (PIG-2293) Pig should support a more efficient merge join against data sources that natively support point lookups or where the join is against large, sparse tables.

Aaron Klish (JIRA) Mon, 19 Sep 2011 13:51:38 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13108132#comment-13108132
 ]


Aaron Klish commented on PIG-2293:
----------------------------------

I guess that would depend on the manner in which it was used.

The primary use case I am concerned about is the large, sparse table stored in 
HDFS.
In this scenario, the right side is still sorted, and access is still 
sequential.

However, the access is sequential point lookups - rather than sequential block 
reads - which will perform better.





> Pig should support a more efficient merge join against data sources that 
> natively support point lookups or where the join is against large, sparse 
> tables.
> ----------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-2293
>                 URL: https://issues.apache.org/jira/browse/PIG-2293
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>    Affects Versions: 0.9.0
>            Reporter: Aaron Klish
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> The existing PIG merge join has the following limitations:
>    1. It assumes the right side of the table must be accessed sequentially - 
> record by record.
>    2. It does not perform well against large, sparse tables.
> The current implementation of the merge join introduced the interface 
> IndexableLoadFunc.  This 'LoadFunc'
> supports the ability to 'seekNear' a given key (before reading the next 
> record).  
> The merge join physical operator only calls 'seekNear' for the first key in 
> each split (effectively eliminating splits
> where the first and subsequent keys will not be found).  Subsequent joins are 
> found by reading sequentially through
> the records on the right table looking for matches from the left table.
> While this method works well for dense join tables - it performs poorly 
> against large sparse tables or data sources that support 
> point lookups natively (HBase for example).
> The proposed enhancement is to add a new join type - 'merge-sparse' to PIG 
> latin.  When specified in the PIG script, this join type
> will cause the merge join operator to call seekNear on each and every key 
> (rather than just the first in each split).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2293) Pig should support a more efficient merge join against data sources that natively support point lookups or where the join is against large, sparse tables.

Reply via email to