[ 
https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559177#action_12559177
 ] 

Charlie Groves commented on PIG-55:
-----------------------------------

The openDFS removal was accidental.  I should be able to add it back by adding 
something to get at the JobConf from PigSplitWrapper

I don't like that split returns a RecordReader where the first field is unused 
either, but the dependency on Hadoop is locked in deeper than that.  
PigSplitFactory needs to take a JobConf so its subclasses can get at the right 
HDFS to lookup the files in location it needs to split.  PigSplit itself 
extends InputSplit, another hadoop class, so if we're removing any references 
to hadoop, we'd need to make an interface like InputSplit that exposes 
getLength and getLocations since those things can't be figured out externally 
from the split.  We'd also need to have some concept like Writable so the split 
can be sent over the wire.  The RecordReader interface returned by the split 
has the same problem:  getPos, close, and getProgress need to be handled by 
user code and can't be inferred by pig.  I feel like the complexity added to 
make interfaces that are really similar to hadoop's is worse than the loss of 
generality from using hadoop's interfaces, especially when the outmost layer of 
code, PigSplitFactory, is going to need access to one hadoop class no matter 
what.

> Allow user control over split creation
> --------------------------------------
>
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Charlie Groves
>         Attachments: replaceable_PigSplit.diff, replaceable_PigSplit_v2.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to 
> access from pig.  This means I can't use LoadFunc to get at the data as it 
> only allows the loader access to a single input stream at a time.  To handle 
> this usage, I've broken the existing split creation code out into a few 
> classes and interfaces, and allowed user specified load functions to be used 
> in place of the existing code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to