Alan Gates commented on PIG-845:

Dmitry wrote> Would it make sense to expose this to the users via a 'CREATE 
INDEX' (or similar) command?
That way the index could be persisted, and the user could tell you to use an 
existing index instead of rescanning the data.

Ashutosh wrote> If we allow that then we also need to deal with managing and 
persisting the index. Once Owl is integrated, we could make use of that to do 
all this for Pig. Till then, we can continue creating index every time and as I 
said overhead of index creation is negligible as compared to run times of 
actual joins.

My thinking was that at some future point, Pig would automatically cache this 
sample the first time it creates it, so that subsequent joins on the same data 
set could make use of it without the sample.  I'm hoping we can use Owl for 
that, as Ashutosh indicated.


Dmitry wrote> I am not sure about the approach of pushing sampling above 
filters. Have you guys benchmarked this? Seems like you'd wind up reading the 
whole file in the sample job if the filter is selective enough (and high filter 
selectivity would also make materialize->sample go much faster).

You want to build your index on the pre-filtered data because your index is 
telling you what block to look for the data in.  The fact that the filter may 
have removed that record doesn't matter.  It will either be in the block 
indicated in the index or not present.  Also, you want to avoid filtering and 
then building the index because it adds another write and read of the data (you 
have to filter, write the data to HDFS, then read it to build the index, then 
read it again to do the join).

> -----------------------
>                 Key: PIG-845
>                 URL: https://issues.apache.org/jira/browse/PIG-845
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Ashutosh Chauhan
>         Attachments: merge-join-1.patch, merge-join-for-review.patch
> Thsi join would work if the data for both tables is sorted on the join key.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to