[
https://issues.apache.org/jira/browse/PIG-3258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joel Fouse updated PIG-3258:
----------------------------
Attachment: MultiStorageMultiIndex.patch
Thanks; attaching now. This includes the modifications to MultiStorage as well
as TestMultiStorage.
In a nutshell, the basic idea is to allow the second parameter, the
splitFieldIndex, to specify e.g. something like "2/1/3" which would mean create
the first level subdirectory based on the value in index 2, the second level
based on the value in index 1, and the third level based on the value in index
3. In the process I also cleaned up and expanded the class level javadoc to
make it more readable and introduce the new capability.
One potential change I haven't made yet but would like feedback on is the
resulting filename pattern. When using regular PigStorage for output, the
files look like /path/to/files/part-r-0001. But currently MultiStorage uses
the value in the field specified by splitFieldIndex to both create the
subfolder as well as name the file, e.g. /path/to/files/a1/a1-0001. That's
okay, but if it now supports numerous levels of indexes and folder structures,
you could end up with a file pattern like
/path/to/files/Monday/breakfast/red/apples/Monday-breakfast-red-apples-0000.
Depending on how many levels the user wants to break things out into, the
filename could start to get rather unwieldy. Does it make sense to continue to
include those values in the filename, or should it (as I would prefer) exhibit
the same behavior as PigStorage and simply name the files as something like
"part-r-[taskid]"? Or is it that the output filenames need to be unique within
the context of the whole job? That might make some unfortunate sense. I
appreciate any insight in this regard.
> Patch to allow MultiStorage to use more than one index to generate output tree
> ------------------------------------------------------------------------------
>
> Key: PIG-3258
> URL: https://issues.apache.org/jira/browse/PIG-3258
> Project: Pig
> Issue Type: Improvement
> Components: piggybank
> Reporter: Joel Fouse
> Priority: Minor
> Labels: piggybank
> Attachments: MultiStorageMultiIndex.patch
>
>
> I have made a patch to enable MultiStorage to handle multiple tuple indexes,
> rather than only one, for generating the output directory structure. Before
> I submit it, though, I need to know if I should generate the patch from
> /contrib/piggybank/java where I've been compiling and unit testing, or back
> at the project root.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira