[
https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557479#action_12557479
]
Charlie Groves commented on PIG-55:
-----------------------------------
Ahh, I left it under impl because it's something that only matters to the
mapreduce portions of pig, but I guess all user implementable interfaces are
going to be in the top level pig package?
I can see what you're saying about PigSplitFactory getting too much
information, but my next step after this was going to be to figure out how to
expose the actual fields used on the loaded values to the split factory. Since
my data is broken out by columns, if I know the accessed fields, I can only
load the data necessary for those fields which will be a huge speedup. I was
thinking I could extract that data from groupbySpec and evalSpec. Is there a
better way to do this?
Regardless of that, the JobConf can be accessed from the PigContext, and the
index value doesn't bear any relevance outside of pig's internals, so I can
drop those parameters. I can also remove the getEvalSpec, getGroupbySpec and
getIndex methods from the PigSplit interface and handle that internally without
encumbering user created splits. However, the PigSplit interface can't go away
altogether because the PigSplitFactory has to be able to return the actual
splits so they can handle the getLength and getLocations methods appropriately
for the hdfs files they're loading, and so they can create the actual
RecordReader method with makeReader. Since that's particular to the style of
loading the split factory is implementing, there's no way to do it generically
from pig.
Another patch forthcoming along these lines.
> Allow user control over split creation
> --------------------------------------
>
> Key: PIG-55
> URL: https://issues.apache.org/jira/browse/PIG-55
> Project: Pig
> Issue Type: Improvement
> Reporter: Charlie Groves
> Attachments: replaceable_PigSplit.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to
> access from pig. This means I can't use LoadFunc to get at the data as it
> only allows the loader access to a single input stream at a time. To handle
> this usage, I've broken the existing split creation code out into a few
> classes and interfaces, and allowed user specified load functions to be used
> in place of the existing code.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.