[jira] [Commented] (AVRO-1130) MapReduce Jobs can output write SortedKeyValueFiles directly

Steven Willis (JIRA) Fri, 06 Jun 2014 06:16:22 -0700

    [ 
https://issues.apache.org/jira/browse/AVRO-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14019825#comment-14019825
 ]


Steven Willis commented on AVRO-1130:
-------------------------------------

I'd very much like this as well. How would you imagine the output would be 
structured? With a normal {{SortedKeyValueFile}} you've got a single directory 
containing exactly two files {{data}} and {{index}}. With a mapreduce that has 
multiple reducers I wonder how this should look.

Maybe:

{noformat}
output_path/data-part-00000
output_path/data-part-00001
output_path/data-part-00002
output_path/index-part-00000
output_path/index-part-00001
output_path/index-part-00002
{noformat}

But then if you wanted to treat {{output_path}} as a {{SortedKeyValueFile}}, 
you'd have to modify the code to allow for multiple data and index files. 
Perhaps any directory containing exactly the same number of {{data*}} and 
{{index*}} files can be treated as a {{SKVF}} as long as the trailing portion 
of each {{data}} filename matched an {{index}} filename.

Or would something like this be better:

{noformat}
output_path/part-00000/data
output_path/part-00000/index
output_path/part-00001/data
output_path/part-00001/index
output_path/part-00002/data
output_path/part-00002/index
{noformat}

That way, each part is a {{SKVF}} and works with the existing code. But then 
you wouldn't be able to treat {{output_path}} as a {{SKVF}}. Maybe the new 
{{SKVFInputFormat}} would allow for the input path to be either an {{SKVF}} 
directory, or a directory containing {{SKVF}} directories.

I think I'd lean towards the first approach myself.

> MapReduce Jobs can output write SortedKeyValueFiles directly
> ------------------------------------------------------------
>
>                 Key: AVRO-1130
>                 URL: https://issues.apache.org/jira/browse/AVRO-1130
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>    Affects Versions: 1.7.1
>            Reporter: Jeremy Lewi
>            Assignee: Harsh J
>            Priority: Minor
>
> It would be nice if MapReduce jobs could write directly to 
> SortedKeyValueFile's.
> harsh@'s response on this thread http://goo.gl/OT1rN for some more 
> information on what needs to be done.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (AVRO-1130) MapReduce Jobs can output write SortedKeyValueFiles directly

Reply via email to