[ 
https://issues.apache.org/jira/browse/AVRO-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Lewi updated AVRO-570:
-----------------------------

    Attachment: AVRO-570.patch

Here is an initial patch. There are a number of issues which I list below. 
There is also an example/test of how a  
python tethered job works in  
{noformat} 
lang/py/test/test_tether_word_count.py 
{noformat} 
Unfortunately, this test won't work for anyone but me because the paths for the 
avro-tools.jar is currently hard-coded. One solution might be to do string 
substitution as part of the build process like we do for some of the strings 
representing the various hand shake protocols. 
 
Here is a list of additional issues 
# Tests are a bit fragile -  for example to run the tests for avro tools you 
need to first install avro-mapred in your local repository 
# There are checkstyle violations 
# The tests for mapred fail if the hadoop version is 0.20.2-cdh3u0, but works 
for 0.20.203.0 
# Hadoop 0.20.2 won't work for this version of the patch see AVRO-852 
# The shade plugin for avro-tools interferes with my use of the assembly plugin 
to build a *-job.jar file suitable for hadoop submission (described more 
below). 
# In the python task, messages logged using logger appear to be going to the 
log for stderr not the log file for stdout 
# Combine reduceFlush and reduce (described more below) 
# Excess comments - I've commented changes to existing code with AVRO-570 to 
facilitate review. 
 
shade and assembly plugin conflict: 
For avro-tools  I modified the pom.xml file so that we build an 
avro-tools-VERSION-job.jar using the assembly plugin. The intention was to 
build a single jar that contains all of the dependent jars in the directory 
"lib" inside the jar file. Unfortunately, the shade plugin caused problems. In 
particular, before removing the shade plugin from the pom.xml file, the 
assembled avro-tools-*-job.jar would end up containing  unpacked versions of 
all the dependency jars (in addition to storing the jars in ./lib).  
 
I removed the shade plugin configuration as a temporary work around. Hopefuly 
some maven wizards out there can help me resolve this.  
 
I think the jar plugin is building two jars one without the dependencies and 
one with the dependencies unpacked. The shade plugin then seems to be renaming 
one of these jars and I think that renaming may be part of the problem. 
 
 
combining reduceFlush and reduce: 
Currently, to implement the reducer in python a user would implement the 
methods "reduce" and "reduceFlush" in a subclass of TetherTask. "reduce" is 
invoked once for each tuple in the reduce phase. "reduceFlush" is invoked once 
after the last tuple in each group.  
 
I would like to replace reduce and reduceFlush with a single reduce function 
that would take a record generator as opposed to a record as its argument. So 
the implementation would look something like 
{noformat} 
def reduce(self, recgen, collector): 
    for  rec in recgen: 
          … process each record in this group 
 
     ... finalize this group... 
{noformat} 
 
The way I'm thinking of implementing this is by having reduce run in a separate 
thread from the server. The generator, recgen, would then suspend the thread 
when no records were available. When the server recieves a record from the 
parent process, it would wake up the thread running reduce. 

I've been working off of a public fork checked into bitbucket
https://bitbucket.org/jlewi/avro




> python implementation of mapreduce connector
> --------------------------------------------
>
>                 Key: AVRO-570
>                 URL: https://issues.apache.org/jira/browse/AVRO-570
>             Project: Avro
>          Issue Type: New Feature
>          Components: python
>            Reporter: Doug Cutting
>            Priority: Critical
>         Attachments: AVRO-570.patch
>
>
> AVRO-512 defines protocols for implementing mapreduce tasks.  It would be 
> good to have a Python implementation of this.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to