[ https://issues.apache.org/jira/browse/PIG-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742245#action_12742245 ]
Alan Gates commented on PIG-845: -------------------------------- Dmitry wrote> Would it make sense to expose this to the users via a 'CREATE INDEX' (or similar) command? That way the index could be persisted, and the user could tell you to use an existing index instead of rescanning the data. Ashutosh wrote> If we allow that then we also need to deal with managing and persisting the index. Once Owl is integrated, we could make use of that to do all this for Pig. Till then, we can continue creating index every time and as I said overhead of index creation is negligible as compared to run times of actual joins. My thinking was that at some future point, Pig would automatically cache this sample the first time it creates it, so that subsequent joins on the same data set could make use of it without the sample. I'm hoping we can use Owl for that, as Ashutosh indicated. ----- Dmitry wrote> I am not sure about the approach of pushing sampling above filters. Have you guys benchmarked this? Seems like you'd wind up reading the whole file in the sample job if the filter is selective enough (and high filter selectivity would also make materialize->sample go much faster). You want to build your index on the pre-filtered data because your index is telling you what block to look for the data in. The fact that the filter may have removed that record doesn't matter. It will either be in the block indicated in the index or not present. Also, you want to avoid filtering and then building the index because it adds another write and read of the data (you have to filter, write the data to HDFS, then read it to build the index, then read it again to do the join). > PERFORMANCE: Merge Join > ----------------------- > > Key: PIG-845 > URL: https://issues.apache.org/jira/browse/PIG-845 > Project: Pig > Issue Type: Improvement > Reporter: Olga Natkovich > Assignee: Ashutosh Chauhan > Attachments: merge-join-1.patch, merge-join-for-review.patch > > > Thsi join would work if the data for both tables is sorted on the join key. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.