[
https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Charlie Groves updated PIG-55:
------------------------------
Attachment: pig_chunker_split_v3.patch
pig_chunker_split_v3.patch adds a simple test for the new functionality. I've
also renamed the new interfaces from Chunk and Chunker to Slice and Slicer. I
like the new names better since slice is actually a verb, so Slicer makes more
sense as a noun than Chunker.
I've also made a couple changes to file handling in pig to allow the Slicer to
fully control the loading process. I moved the file existence check on the
location passed to a LOAD statement from the parser to the new default loading
code, PigSlicer. I don't think it makes much sense for the parser to be
checking on the filesystem, and this allows the creation of Slicers that don't
have any files on the filesystem like the one I made for the test. Code using
the existing LOAD mechanisms will still get an exception and fail because of
the lack of file existence before the map reduce job truly starts. I also
removed the absolutizing of files in MapreducePlanCompiler. There's actually
no need for it to be done there as DataStorage will correctly absolutize a
location whenever it's given a relative path, and it doesn't make any sense to
prepend a file path to a location that isn't actually on the file system, so
the decision to absolutize needs to be left up to the Slicer.
> Allow user control over split creation
> --------------------------------------
>
> Key: PIG-55
> URL: https://issues.apache.org/jira/browse/PIG-55
> Project: Pig
> Issue Type: Improvement
> Affects Versions: 0.0.0
> Reporter: Charlie Groves
> Fix For: 0.1.0
>
> Attachments: pig_chunker_split.patch, pig_chunker_split_v2.patch,
> pig_chunker_split_v3.patch, replaceable_PigSplit.diff,
> replaceable_PigSplit_v2.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to
> access from pig. This means I can't use LoadFunc to get at the data as it
> only allows the loader access to a single input stream at a time. To handle
> this usage, I've broken the existing split creation code out into a few
> classes and interfaces, and allowed user specified load functions to be used
> in place of the existing code.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.