[jira] Updated: (MAHOUT-274) Use avro for serialization of structured documents.

Drew Farris (JIRA) Mon, 15 Feb 2010 19:55:01 -0800

     [ 
https://issues.apache.org/jira/browse/MAHOUT-274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Drew Farris updated MAHOUT-274:
-------------------------------

    Attachment: mahout-avro-examples.tar.bz

Status update w/ new tarball which contains a maven project (mvn clean install 
should do the trick) 

README.txt included, relevant portions included below:

Provided are two different versions of AvroInputFormat/AvroOutputFormat that 
are compatible with the mapred (pre 0.20) and mapreduce (0.20+) apis. They are 
based on, code provided as a part of  MAPREDUCE-815 and other patches. Also 
provided are backports of the SerializationBase/AvroSerialization classes from 
the current hadoop-core trunk.

When writing a job using the pre 0.20 apis:

Add serializations:

{code}
        conf.setStrings("io.serializations",
        new String[] {
          WritableSerialization.class.getName(), 
          AvroSpecificSerialization.class.getName(), 
          AvroReflectSerialization.class.getName(),
          AvroGenericSerialization.class.getName()
        });
{code}

Setup input and output formats:

{code}
    conf.setInputFormat(AvroInputFormat.class);
    conf.setOutputFormat(AvroOutputFormat.class);
    
    AvroInputFormat.setAvroInputClass(conf, AvroDocument.class);
    AvroOutputFormat.setAvroOutputClass(conf, AvroDocument.class);
{code}

AvroInputFormat provides the specified class as the key and a LongWritable file 
offset as the value.
AvroOutputFormat expects the specified class as the key and expects a 
NullWritable as a value.

If an avro serializable class is passed between the map and reduce phases it is 
necessary to set the following:

{code}
    AvroComparator.setSchema(AvroDocument._SCHEMA);
    conf.setClass("mapred.output.key.comparator.class", 
      AvroComparator.class, RawComparator.class);
{code}

So far I've been using avro 'specific' serialization, which compiles an avro 
schema into a Java class. see 
src/main/schemata/org/apache/mahout/avro/AvroDocument.avsc. This is currently 
compiled into classes o.a.m.avro.document (AvroDocument|AvroField) using 
o.a.m.avro.util.AvroDocumentCompiler (eventually to be replaced by a maven 
plugin, Generated sources are currently checked in.).

Helper classes for AvroDocument and AvroField include 
o.a.m.avro.document.Avro(Document|Field)Builder,  
o.a.m.avro(Document|Field)Reader. This seems to work ok here, but I'm not 
certain that this is be best pattern to use, especially when there are many 
pre-existing classes (such as there are in the case of vector. 

Avro also provides reflection-based serialization and schema-based 
serialization, both should be supported by the infrastructure that has been 
backported here, but that's something else to explore.
 
Examples:

These are quick and dirty and need much cleanup work before they can be taken 
out to the dance.

see o.a.m.avro.text, o.a.m.avro.text.mapred and o.a.m.avro.text.mapreduce:

* AvroDocumentsFromDirectory: quick and dirty port of 
SequenceFilesFromDirectory to use AvroDocuments. Writes a file containing 
documents in avro format; file contents is stored in a single field named 
'content', contents are stored in the originalText portion of this field.
* AvroDocumentsDumper: dump an avro documents file to a standard output
* AvroDocumentsWordCount: perform a wordcount on an avro document input file.
* AvroDocumentProcessor: tokenizes the text found in the input document file, 
reads from the originalText of the field named content and writes original 
document+tokens to output file.

Running the examples:

(haven't tested with the hadoop driver yet)

{code}
mvn exec:java 
-Dexec.mainClass=org.apache.mahout.avro.text.AvroDocumentsFromDirectory \
  -Dexec.args='--parent /home/drew/mahout/20news-18828 \
  --outputDir /home/drew/mahout/20news-18828-example \
  --charset UTF-8'

mvn exec:java 
-Dexec.mainClass=org.apache.mahout.avro.text.mapred.AvroDocumentProcessor \
   -Dexec.args='/home/drew/mahout/20news-18828-example 
/home/drew/mahout/20news-18828-processed' 

mvn exec:java -Dexec.mainClass=org.apache.mahout.avro.text.AvroDocumentsDumper \
  -Dexec.args='/home/drew/mahout/20news-18828-processed/.avro-r-00000' > 
foobar.txt
{code}

The Wikipedia stuff is in there, but isn't working yet. Many thanks (apologies) 
to Robin for the starting point for much of this code and hacking it to pieces 
so badly. 


> Use avro for serialization of structured documents.
> ---------------------------------------------------
>
>                 Key: MAHOUT-274
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-274
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Drew Farris
>            Priority: Minor
>         Attachments: mahout-avro-examples.tar.bz, mahout-avro-examples.tar.gz
>
>
> Explore the intersection between Writables and Avro to see how serialization 
> can be improved within Mahout. 
> An intermediate goal is the provide a structured document format that can be 
> serialized using Avro as an Input/OutputFormat and Writable 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-274) Use avro for serialization of structured documents.

Reply via email to