[ https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12760710#action_12760710 ]
Pradeep Kamath commented on PIG-953: ------------------------------------ Dmitriy, I looked at the ResourceSchema proposed in http://wiki.apache.org/pig/LoadStoreRedesignProposal and also spoke with Alan to understand the intent more. The eventual goal is for the setSchema() call in StoreFunc to give the ResourceSchema to the store implementation. The ResourceSchema will contain both pig schema information and sort column information. So Zebra or any other storage function which needs to know about sort columns will get the information from the ResourceSchema passed in setSchema(). However, today there is a way pig runtime conveys the pig schema to store functions (through StoreConfig). We need a separate way to give sort information since pig schema does not have the ability to give it. Since after the rewrite of load/store interfaces this problem will be solved through setSchema(), the solution which we will come up with now in this jira will anyway need to be re-written. So it is cleaner to only keep sort column information in SortColInfo and have an array of SortColInfo in SortInfo. If instead we use ResourceSchema then StoreConfig will have a pig Schema and a Resource Schema which would also be confusing to callers. In short, since this piece code of code will need a re-write later, it is better not to make it generic now and just address immediate needs and the re-write should remove multiple representations of schema/sort information. > Enable merge join in pig to work with loaders and store functions which can > internally index sorted data > --------------------------------------------------------------------------------------------------------- > > Key: PIG-953 > URL: https://issues.apache.org/jira/browse/PIG-953 > Project: Pig > Issue Type: Improvement > Affects Versions: 0.3.0 > Reporter: Pradeep Kamath > Assignee: Pradeep Kamath > Attachments: PIG-953-2.patch, PIG-953.patch > > > Currently merge join implementation in pig includes construction of an index > on sorted data and use of that index to seek into the "right input" to > efficiently perform the join operation. Some loaders (notably the zebra > loader) internally implement an index on sorted data and can perform this > seek efficiently using their index. So the use of the index needs to be > abstracted in such a way that when the loader supports indexing, pig uses it > (indirectly through the loader) and does not construct an index. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.