[
https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567085#action_12567085
]
Benjamin Reed commented on PIG-55:
----------------------------------
I just went over this with Antonio this morning. I think the functionality is
very important, but there are a couple of things that bother me.
a) The biggest one is the dependence on Hadoop classes. I think that is the
easiest to fix.
b) Do Split factories also need to figure out where to schedule things? It
seems very platform specific. In general it seems that specifying the files
used by the split will allow Pig to figure out the best way to place the
processing.
c) Another one is the binding to load functions. It's reasonable to say that
LoadFunctions should know how to split, but the binding seems tight. For
example, if you have a set of URLs in a file separated with line feeds do you
want to have to write a new LoadFunction just so that you can split it in a
different way (maybe finer cuts for example)?
d)Do you also want to put all the logic to handle compressed files in each
split factory? Potentially you may want to combine splits together: one chops
at block/compression boundaries. Followed by another split that chops even
finer or perhaps puts splits together.
I'm not sure how to address c) and d), but for a) and b) I think we can tweak
your proposal slightly:
{noformat}
class FileChunk {
long length;
long offset;
String filename;
}
interface Split implements Serializable {
FileChunk[] getChunks();
}
interface SplitFactory {
Split[] getSplits(String input);
}
{noformat}
We would include the logic that you propose to check the LoadFunction to see if
it implements SplitFactory and use it to generate the splits if it does. I
think this is generic. The FileChunks lets us do placement without requiring
the splits to worry about block locations or DFS specific stuff.
I'm wondering about conveying splittable information about compressed files. We
can split bzipped files and soon we will be able to do some kinds of gzipped
files, so we need a nice way of conveying that information to the SplitFactory.
I am a bit stuck on separating splitting from parsing. I'm not proposing the
following, but rather thinking out loud:
{noformat}
A = chop 'filename' using ChopFunction();
B = load A using ParseFunction();
C = group B by $1;
store C into 'blah';
{noformat}
or simply
store (group (load (chop 'filename' using ChopFunction()) using ParseFunction)
by $1) into 'blah';
(We would need to use "chop" since "split" is already a keyword.
> Allow user control over split creation
> --------------------------------------
>
> Key: PIG-55
> URL: https://issues.apache.org/jira/browse/PIG-55
> Project: Pig
> Issue Type: Improvement
> Reporter: Charlie Groves
> Attachments: replaceable_PigSplit.diff, replaceable_PigSplit_v2.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to
> access from pig. This means I can't use LoadFunc to get at the data as it
> only allows the loader access to a single input stream at a time. To handle
> this usage, I've broken the existing split creation code out into a few
> classes and interfaces, and allowed user specified load functions to be used
> in place of the existing code.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.