[ https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Pradeep Kamath updated PIG-953: ------------------------------- Attachment: PIG-953.patch Attached patch has the changes in pig to support interacting with Loaders which can construct the index internally. The main change is to hide the index access behind and interface. So a new interface called IndexableLoadFunc which extends LoadFunc has been introduced with the following two methods: {code} /** * This method is called by the pig runtime to indicate * to the LoadFunc to position its underlying input stream * near the keys supplied as the argument. Specifically: * 1) if the keys are present in the input stream, the loadfunc * implementation should position its read position to * a record where the key(s) is/are the biggest key(s) less than * the key(s) supplied in the argument OR to the record with the * first occurrence of the keys(s) supplied. * 2) if the key(s) are absent in the input stream, the implementation * should position its read position to a record where the key(s) * is/are the biggest key(s) less than the key(s) supplied OR to the * first record where the key(s) is/are the smallest key(s) greater * than the keys(s) supplied. * The description above holds for descending order data in * a similar manner with "biggest" and "less than" replaced with * "smallest" and "greater than" and vice versa. * * @param keys Tuple with join keys (which are a prefix of the sort * keys of the input data). For example if the data is sorted on * columns in position 2,4,5 any of the following Tuples are * valid as an argument value: * (fieldAt(2)) * (fieldAt(2), fieldAt(4)) * (fieldAt(2), fieldAt(4), fieldAt(5)) * * The following are some invalid cases: * (fieldAt(4)) * (fieldAt(2), fieldAt(5)) * (fieldAt(4), fieldAt(5)) * * @throws IOException When the loadFunc is unable to position * to the required point in its input stream */ public void seekNear(Tuple keys) throws IOException; /** * A method called by the pig runtime to give an opportunity * for implementations to perform cleanup actions like closing * the underlying input stream. This is necessary since while * performing a join the pig run time may determine than no further * join is possible with remaining records and may indicate to the * IndexableLoader to cleanup by calling this method. * * @throws IOException if the loadfunc is unable to perform * its close actions. */ public void close() throws IOException; {code} The idea is that the POMergeJoin will use seekNear to indicate to the loader to position itself to the correct point in the right input. To keep the POMergeJoin implementation simple, for the default case (where the loader (for example PigStorage) does not implement IndexableLoadFunc), a DefaultIndexableLoader which encapsulates the real loader and provides the implementation for IndexableLoadFunc's methods will be used. In this case, an index will be created as it is done currently and DefaultIndexableLoader will use that index to implement IndexableLoadFunc's methods. A SortInfo class containing names of sort columns and ascending/descending information is also introduce and will be available through StoreConfig. This will be useful for ZebraStore to determine whether the data it is writing out is sorted and to create an index appropriately. > Enable merge join in pig to work with loaders which can internally index > sorted data > ------------------------------------------------------------------------------------- > > Key: PIG-953 > URL: https://issues.apache.org/jira/browse/PIG-953 > Project: Pig > Issue Type: Improvement > Affects Versions: 0.3.0 > Reporter: Pradeep Kamath > Assignee: Pradeep Kamath > Attachments: PIG-953.patch > > > Currently merge join implementation in pig includes construction of an index > on sorted data and use of that index to seek into the "right input" to > efficiently perform the join operation. Some loaders (notably the zebra > loader) internally implement an index on sorted data and can perform this > seek efficiently using their index. So the use of the index needs to be > abstracted in such a way that when the loader supports indexing, pig uses it > (indirectly through the loader) and does not construct an index. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.