[
https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12561856#action_12561856
]
Benjamin Reed commented on PIG-55:
----------------------------------
Sorry, to take so long to comment. I was hoping to take a swipe at this, but I
haven't been able to get the time.
We cannot expose Hadoop classes in Pig. There are other backends that Pig runs
on and we don't want to pull all of Hadoop with us.
Antonio has a generalized file access layer PIG-32 that we should incorporate
with. PigSplit is an internal class specific to Hadoop, so we shouldn't expose
that.
At a higher level, there is something else I would like to be able to do as
well: multi file splits. The notion that a split never spans a file is
problematic when files are small. It seems like we should be more flexible in
that area. We also need fileless splits for load functions that generate tuples
"from thin air".
> Allow user control over split creation
> --------------------------------------
>
> Key: PIG-55
> URL: https://issues.apache.org/jira/browse/PIG-55
> Project: Pig
> Issue Type: Improvement
> Reporter: Charlie Groves
> Attachments: replaceable_PigSplit.diff, replaceable_PigSplit_v2.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to
> access from pig. This means I can't use LoadFunc to get at the data as it
> only allows the loader access to a single input stream at a time. To handle
> this usage, I've broken the existing split creation code out into a few
> classes and interfaces, and allowed user specified load functions to be used
> in place of the existing code.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.