Alan Gates commented on PIG-845:
Dmitry wrote> Would it make sense to expose this to the users via a 'CREATE
INDEX' (or similar) command?
That way the index could be persisted, and the user could tell you to use an
existing index instead of rescanning the data.
Ashutosh wrote> If we allow that then we also need to deal with managing and
persisting the index. Once Owl is integrated, we could make use of that to do
all this for Pig. Till then, we can continue creating index every time and as I
said overhead of index creation is negligible as compared to run times of
My thinking was that at some future point, Pig would automatically cache this
sample the first time it creates it, so that subsequent joins on the same data
set could make use of it without the sample. I'm hoping we can use Owl for
that, as Ashutosh indicated.
Dmitry wrote> I am not sure about the approach of pushing sampling above
filters. Have you guys benchmarked this? Seems like you'd wind up reading the
whole file in the sample job if the filter is selective enough (and high filter
selectivity would also make materialize->sample go much faster).
You want to build your index on the pre-filtered data because your index is
telling you what block to look for the data in. The fact that the filter may
have removed that record doesn't matter. It will either be in the block
indicated in the index or not present. Also, you want to avoid filtering and
then building the index because it adds another write and read of the data (you
have to filter, write the data to HDFS, then read it to build the index, then
read it again to do the join).
> PERFORMANCE: Merge Join
> Key: PIG-845
> URL: https://issues.apache.org/jira/browse/PIG-845
> Project: Pig
> Issue Type: Improvement
> Reporter: Olga Natkovich
> Assignee: Ashutosh Chauhan
> Attachments: merge-join-1.patch, merge-join-for-review.patch
> Thsi join would work if the data for both tables is sorted on the join key.
This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.