[ 
https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557479#action_12557479
 ] 

Charlie Groves commented on PIG-55:
-----------------------------------

Ahh, I left it under impl because it's something that only matters to the 
mapreduce portions of pig, but I guess all user implementable interfaces are 
going to be in the top level pig package?

I can see what you're saying about PigSplitFactory getting too much 
information, but my next step after this was going to be to figure out how to 
expose the actual fields used on the loaded values to the split factory.  Since 
my data is broken out by columns, if I know the accessed fields, I can only 
load the data necessary for those fields which will be a huge speedup.  I was 
thinking I could extract that data from groupbySpec and evalSpec.  Is there a 
better way to do this?

Regardless of that, the JobConf can be accessed from the PigContext, and the 
index value doesn't bear any relevance outside of pig's internals, so I can 
drop those parameters.  I can also remove the getEvalSpec, getGroupbySpec and 
getIndex methods from the PigSplit interface and handle that internally without 
encumbering user created splits.  However, the PigSplit interface can't go away 
altogether because the PigSplitFactory has to be able to return the actual 
splits so they can handle the getLength and getLocations methods appropriately 
for the hdfs files they're loading, and so they can create the actual 
RecordReader method with makeReader.  Since that's particular to the style of 
loading the split factory is implementing, there's no way to do it generically 
from pig.

Another patch forthcoming along these lines.

> Allow user control over split creation
> --------------------------------------
>
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Charlie Groves
>         Attachments: replaceable_PigSplit.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to 
> access from pig.  This means I can't use LoadFunc to get at the data as it 
> only allows the loader access to a single input stream at a time.  To handle 
> this usage, I've broken the existing split creation code out into a few 
> classes and interfaces, and allowed user specified load functions to be used 
> in place of the existing code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to