Re: [jira] Updated: (MAHOUT-274) Use avro for serialization of structured documents.

Markus Weimer Tue, 16 Feb 2010 22:23:27 -0800

Hi,

that looks like cool stuff! Does it support arbitrary avro-serializable
types?


Thanks,

Markus

On Mon, Feb 15, 2010 at 7:54 PM, Drew Farris (JIRA) <j...@apache.org> wrote:

>
>     [
> https://issues.apache.org/jira/browse/MAHOUT-274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>
> Drew Farris updated MAHOUT-274:
> -------------------------------
>
>     Attachment: mahout-avro-examples.tar.bz
>
> Status update w/ new tarball which contains a maven project (mvn clean
> install should do the trick)
>
> README.txt included, relevant portions included below:
>
> Provided are two different versions of AvroInputFormat/AvroOutputFormat
> that are compatible with the mapred (pre 0.20) and mapreduce (0.20+) apis.
> They are based on, code provided as a part of  MAPREDUCE-815 and other
> patches. Also provided are backports of the
> SerializationBase/AvroSerialization classes from the current hadoop-core
> trunk.
>
> When writing a job using the pre 0.20 apis:
>
> Add serializations:
>
> {code}
>        conf.setStrings("io.serializations",
>        new String[] {
>          WritableSerialization.class.getName(),
>          AvroSpecificSerialization.class.getName(),
>          AvroReflectSerialization.class.getName(),
>          AvroGenericSerialization.class.getName()
>        });
> {code}
>
> Setup input and output formats:
>
> {code}
>    conf.setInputFormat(AvroInputFormat.class);
>    conf.setOutputFormat(AvroOutputFormat.class);
>
>    AvroInputFormat.setAvroInputClass(conf, AvroDocument.class);
>    AvroOutputFormat.setAvroOutputClass(conf, AvroDocument.class);
> {code}
>
> AvroInputFormat provides the specified class as the key and a LongWritable
> file offset as the value.
> AvroOutputFormat expects the specified class as the key and expects a
> NullWritable as a value.
>
> If an avro serializable class is passed between the map and reduce phases
> it is necessary to set the following:
>
> {code}
>    AvroComparator.setSchema(AvroDocument._SCHEMA);
>    conf.setClass("mapred.output.key.comparator.class",
>      AvroComparator.class, RawComparator.class);
> {code}
>
> So far I've been using avro 'specific' serialization, which compiles an
> avro schema into a Java class. see
> src/main/schemata/org/apache/mahout/avro/AvroDocument.avsc. This is
> currently compiled into classes o.a.m.avro.document (AvroDocument|AvroField)
> using o.a.m.avro.util.AvroDocumentCompiler (eventually to be replaced by a
> maven plugin, Generated sources are currently checked in.).
>
> Helper classes for AvroDocument and AvroField include
> o.a.m.avro.document.Avro(Document|Field)Builder,
>  o.a.m.avro(Document|Field)Reader. This seems to work ok here, but I'm not
> certain that this is be best pattern to use, especially when there are many
> pre-existing classes (such as there are in the case of vector.
>
> Avro also provides reflection-based serialization and schema-based
> serialization, both should be supported by the infrastructure that has been
> backported here, but that's something else to explore.
>
> Examples:
>
> These are quick and dirty and need much cleanup work before they can be
> taken out to the dance.
>
> see o.a.m.avro.text, o.a.m.avro.text.mapred and o.a.m.avro.text.mapreduce:
>
> * AvroDocumentsFromDirectory: quick and dirty port of
> SequenceFilesFromDirectory to use AvroDocuments. Writes a file containing
> documents in avro format; file contents is stored in a single field named
> 'content', contents are stored in the originalText portion of this field.
> * AvroDocumentsDumper: dump an avro documents file to a standard output
> * AvroDocumentsWordCount: perform a wordcount on an avro document input
> file.
> * AvroDocumentProcessor: tokenizes the text found in the input document
> file, reads from the originalText of the field named content and writes
> original document+tokens to output file.
>
> Running the examples:
>
> (haven't tested with the hadoop driver yet)
>
> {code}
> mvn exec:java
> -Dexec.mainClass=org.apache.mahout.avro.text.AvroDocumentsFromDirectory \
>  -Dexec.args='--parent /home/drew/mahout/20news-18828 \
>  --outputDir /home/drew/mahout/20news-18828-example \
>  --charset UTF-8'
>
> mvn exec:java
> -Dexec.mainClass=org.apache.mahout.avro.text.mapred.AvroDocumentProcessor \
>   -Dexec.args='/home/drew/mahout/20news-18828-example
> /home/drew/mahout/20news-18828-processed'
>
> mvn exec:java
> -Dexec.mainClass=org.apache.mahout.avro.text.AvroDocumentsDumper \
>  -Dexec.args='/home/drew/mahout/20news-18828-processed/.avro-r-00000' >
> foobar.txt
> {code}
>
> The Wikipedia stuff is in there, but isn't working yet. Many thanks
> (apologies) to Robin for the starting point for much of this code and
> hacking it to pieces so badly.
>
>
> > Use avro for serialization of structured documents.
> > ---------------------------------------------------
> >
> >                 Key: MAHOUT-274
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-274
> >             Project: Mahout
> >          Issue Type: Improvement
> >            Reporter: Drew Farris
> >            Priority: Minor
> >         Attachments: mahout-avro-examples.tar.bz,
> mahout-avro-examples.tar.gz
> >
> >
> > Explore the intersection between Writables and Avro to see how
> serialization can be improved within Mahout.
> > An intermediate goal is the provide a structured document format that can
> be serialized using Avro as an Input/OutputFormat and Writable
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>

Re: [jira] Updated: (MAHOUT-274) Use avro for serialization of structured documents.

Reply via email to