[
https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559177#action_12559177
]
Charlie Groves commented on PIG-55:
-----------------------------------
The openDFS removal was accidental. I should be able to add it back by adding
something to get at the JobConf from PigSplitWrapper
I don't like that split returns a RecordReader where the first field is unused
either, but the dependency on Hadoop is locked in deeper than that.
PigSplitFactory needs to take a JobConf so its subclasses can get at the right
HDFS to lookup the files in location it needs to split. PigSplit itself
extends InputSplit, another hadoop class, so if we're removing any references
to hadoop, we'd need to make an interface like InputSplit that exposes
getLength and getLocations since those things can't be figured out externally
from the split. We'd also need to have some concept like Writable so the split
can be sent over the wire. The RecordReader interface returned by the split
has the same problem: getPos, close, and getProgress need to be handled by
user code and can't be inferred by pig. I feel like the complexity added to
make interfaces that are really similar to hadoop's is worse than the loss of
generality from using hadoop's interfaces, especially when the outmost layer of
code, PigSplitFactory, is going to need access to one hadoop class no matter
what.
> Allow user control over split creation
> --------------------------------------
>
> Key: PIG-55
> URL: https://issues.apache.org/jira/browse/PIG-55
> Project: Pig
> Issue Type: Improvement
> Reporter: Charlie Groves
> Attachments: replaceable_PigSplit.diff, replaceable_PigSplit_v2.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to
> access from pig. This means I can't use LoadFunc to get at the data as it
> only allows the loader access to a single input stream at a time. To handle
> this usage, I've broken the existing split creation code out into a few
> classes and interfaces, and allowed user specified load functions to be used
> in place of the existing code.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.