[ 
https://issues.apache.org/jira/browse/AVRO-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158139#comment-14158139
 ] 

Steven Willis commented on AVRO-570:
------------------------------------

I've been working on a more minimal patch that does away with many of the 
whitespace changes and code changes that seem unrelated or unnecessary for 
tethered python. Would it be helpful to see that? I'm torn between about how 
much I should change from the original patchset, which seemed to be code 
reviewed thoroughly, and how much change should be done to HEAD.

I was also reading through:

https://cwiki.apache.org/confluence/display/AVRO/Using+AVRO+To+Run+Python+Map+Reduce+Jobs

And I saw that the code calls {{reduce(record, collector)}} for each {{record}} 
({{record}} includes both key and value) and then {{reduceFlush(record, 
collector)}} once for each key after all the values for the key have been sent 
to {{reduce}}. Wouldn't it be a better design to have {{reduce(key, values, 
collector)}} called once for each {{key}}, where {{values}} is an iterable over 
all the values for that key. You could even not bother with the collector, and 
just have the user {{yield}} the records for output (same for the mapper).

I'm imagining something like:

{noformat}
class WordCount(TetherTask):
  def __init__(self, *args, **kwargs):
    map_input_schema = '{"type": "string"}'
    map_output_schema = '{"type": "record", "name": "Pair", 
"namespace":"org.apache.avro.mapred","fields":[
      {"name":"key","type":"string"},
      {"name":"value","type":"long"}]}'
    output_schema = map_output_schema

    super(WordCount,self).__init__(map_input_schema, map_output_schema, 
output_schema)

  def map(self, text):
    for word in text.split():
      yield word, 1

    # I think this would also work:
    # return (word, 1 for word in text.split())

  def reduce(self, word, counts):
    yield word, sum(counts)
{noformat}

This is how [luigi|https://github.com/spotify/luigi] does it.

> python implementation of mapreduce connector
> --------------------------------------------
>
>                 Key: AVRO-570
>                 URL: https://issues.apache.org/jira/browse/AVRO-570
>             Project: Avro
>          Issue Type: New Feature
>          Components: python
>    Affects Versions: 1.7.0
>            Reporter: Doug Cutting
>            Assignee: Jeremy Lewi
>            Priority: Critical
>              Labels: hadoop
>             Fix For: 1.8.0
>
>         Attachments: AVRO-570.patch, AVRO-570.patch, AVRO-570.patch, 
> AVRO-570.patch, AVRO-570.patch, AVRO-570.patch, AVRO-570.patch, AVRO-570.patch
>
>
> AVRO-512 defines protocols for implementing mapreduce tasks.  It would be 
> good to have a Python implementation of this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to