[ 
https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567085#action_12567085
 ] 

Benjamin Reed commented on PIG-55:
----------------------------------

I just went over this with Antonio this morning. I think the functionality is 
very important, but there are a couple of things that bother me.

a) The biggest one is the dependence on Hadoop classes. I think that is the 
easiest to fix.
b) Do Split factories also need to figure out where to schedule things? It 
seems very platform specific. In general it seems that specifying the files 
used by the split will allow Pig to figure out the best way to place the 
processing.
c) Another one is the binding to load functions. It's reasonable to say that 
LoadFunctions should know how to split, but the binding seems tight. For 
example, if you have a set of URLs in a file separated with line feeds do you 
want to have to write a new LoadFunction just so that you can split it in a 
different way (maybe finer cuts for example)?
d)Do you also want to put all the logic to handle compressed files in each 
split factory? Potentially you may want to combine splits together: one chops 
at block/compression boundaries. Followed by another split that chops even 
finer or perhaps puts splits together.

I'm not sure how to address c) and d), but for a) and b) I think we can tweak 
your proposal slightly:

{noformat}
class FileChunk {
    long length;
    long offset;
    String filename;
}

interface Split implements Serializable {
    FileChunk[] getChunks();
}

interface SplitFactory {
    Split[] getSplits(String input);
}

{noformat}

We would include the logic that you propose to check the LoadFunction to see if 
it implements SplitFactory and use it to generate the splits if it does. I 
think this is generic. The FileChunks lets us do placement without requiring 
the splits to worry about block locations or DFS specific stuff.

I'm wondering about conveying splittable information about compressed files. We 
can split bzipped files and soon we will be able to do some kinds of gzipped 
files, so we need a nice way of conveying that information to the SplitFactory.

I am a bit stuck on separating splitting from parsing. I'm not proposing the 
following, but rather thinking out loud:

{noformat}
A = chop 'filename' using ChopFunction();
B = load A using ParseFunction();
C = group B by $1;
store C into 'blah';
{noformat}

or simply
store (group (load (chop 'filename' using ChopFunction()) using ParseFunction) 
by $1) into 'blah';

(We would need to use "chop" since "split" is already a keyword.

> Allow user control over split creation
> --------------------------------------
>
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Charlie Groves
>         Attachments: replaceable_PigSplit.diff, replaceable_PigSplit_v2.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to 
> access from pig.  This means I can't use LoadFunc to get at the data as it 
> only allows the loader access to a single input stream at a time.  To handle 
> this usage, I've broken the existing split creation code out into a few 
> classes and interfaces, and allowed user specified load functions to be used 
> in place of the existing code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to