[ https://issues.apache.org/jira/browse/PHOENIX-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14269181#comment-14269181 ]
Gabriel Reid commented on PHOENIX-1561: --------------------------------------- I'm not overly familiar with Pig internals, but I get the feeling that there are a number of additional constraints that come in to play for using merge joins on Phoenix tables, or more generally, for fulfilling the requirements of OrderedLoadFunc and/or CollectableLoadFunc. As I understand it, the intention of the OrderedLoadFunc interface is to provide Pig with the guarantee that incoming records will be sorted within an input split, as well as providing a method for sorting input splits. I think that these constraints will only be fulfilled if: * a leading subset (or the full set) of the primary key fields are being read * salt buckets are not being used on the table being read PhoenixHBaseLoader can actually check these conditions off, so maybe it would better (if possible) to throw an exception (somewhere) if the OrderedLoadFunc functionality is being used when a leading subset of primary key fields are not being read or a table has salt buckets. I also noticed that the javadoc of CollectableLoadFunc#ensureAllKeyInstancesInSameSplit states {quote} When this method is called, Pig is communicating to the Loader that it must load data such that all instances of a key are in same split. {quote} I assume that the "keys" being referred to here are key fields being used in a join, and not the key being read via the input format. If this is not the case, then there's probably another problem, as the key used in PhoenixHBaseLoader in NullWritable (meaning that the same key would be distributed over multiple splits). > Pig optimized joins > ------------------- > > Key: PHOENIX-1561 > URL: https://issues.apache.org/jira/browse/PHOENIX-1561 > Project: Phoenix > Issue Type: Bug > Affects Versions: 4.2 > Reporter: Brian Johnson > Assignee: Brian Johnson > Attachments: 0001-PHOENIX-1561-Optimizing-Joins.patch, patch > > > PhoenixHBaseLoader should implement both OrderedLoadFunc and > CollectableLoadFunc just like HBaseStorage. There is nothing special that > needs to be done other than implementing a single method. As in HBaseStorage, > it is up to the user to ensure that the required constraints are not > violated. > {code:java} > public void ensureAllKeyInstancesInSameSplit() throws IOException { > /** > * no-op because hbase keys are unique > * This will also work with things like > DelimitedKeyPrefixRegionSplitPolicy > * if you need a partial key match to be included in the split > */ > LOG.debug("ensureAllKeyInstancesInSameSplit"); > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)