[ 
https://issues.apache.org/jira/browse/PIG-3258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Fouse updated PIG-3258:
----------------------------

    Attachment: MultiStorageMultiIndex.patch

Thanks; attaching now.  This includes the modifications to MultiStorage as well 
as TestMultiStorage.

In a nutshell, the basic idea is to allow the second parameter, the 
splitFieldIndex, to specify e.g. something like "2/1/3" which would mean create 
the first level subdirectory based on the value in index 2, the second level 
based on the value in index 1, and the third level based on the value in index 
3.  In the process I also cleaned up and expanded the class level javadoc to 
make it more readable and introduce the new capability.

One potential change I haven't made yet but would like feedback on is the 
resulting filename pattern.  When using regular PigStorage for output, the 
files look like /path/to/files/part-r-0001.  But currently MultiStorage uses 
the value in the field specified by splitFieldIndex to both create the 
subfolder as well as name the file, e.g. /path/to/files/a1/a1-0001.  That's 
okay, but if it now supports numerous levels of indexes and folder structures, 
you could end up with a file pattern like 
/path/to/files/Monday/breakfast/red/apples/Monday-breakfast-red-apples-0000.  
Depending on how many levels the user wants to break things out into, the 
filename could start to get rather unwieldy.  Does it make sense to continue to 
include those values in the filename, or should it (as I would prefer) exhibit 
the same behavior as PigStorage and simply name the files as something like 
"part-r-[taskid]"?  Or is it that the output filenames need to be unique within 
the context of the whole job?  That might make some unfortunate sense.  I 
appreciate any insight in this regard.
                
> Patch to allow MultiStorage to use more than one index to generate output tree
> ------------------------------------------------------------------------------
>
>                 Key: PIG-3258
>                 URL: https://issues.apache.org/jira/browse/PIG-3258
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Joel Fouse
>            Priority: Minor
>              Labels: piggybank
>         Attachments: MultiStorageMultiIndex.patch
>
>
> I have made a patch to enable MultiStorage to handle multiple tuple indexes, 
> rather than only one, for generating the output directory structure.  Before 
> I submit it, though, I need to know if I should generate the patch from 
> /contrib/piggybank/java where I've been compiling and unit testing, or back 
> at the project root.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to