Hi, that looks like cool stuff! Does it support arbitrary avro-serializable types?
Thanks, Markus On Mon, Feb 15, 2010 at 7:54 PM, Drew Farris (JIRA) <j...@apache.org> wrote: > > [ > https://issues.apache.org/jira/browse/MAHOUT-274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel] > > Drew Farris updated MAHOUT-274: > ------------------------------- > > Attachment: mahout-avro-examples.tar.bz > > Status update w/ new tarball which contains a maven project (mvn clean > install should do the trick) > > README.txt included, relevant portions included below: > > Provided are two different versions of AvroInputFormat/AvroOutputFormat > that are compatible with the mapred (pre 0.20) and mapreduce (0.20+) apis. > They are based on, code provided as a part of MAPREDUCE-815 and other > patches. Also provided are backports of the > SerializationBase/AvroSerialization classes from the current hadoop-core > trunk. > > When writing a job using the pre 0.20 apis: > > Add serializations: > > {code} > conf.setStrings("io.serializations", > new String[] { > WritableSerialization.class.getName(), > AvroSpecificSerialization.class.getName(), > AvroReflectSerialization.class.getName(), > AvroGenericSerialization.class.getName() > }); > {code} > > Setup input and output formats: > > {code} > conf.setInputFormat(AvroInputFormat.class); > conf.setOutputFormat(AvroOutputFormat.class); > > AvroInputFormat.setAvroInputClass(conf, AvroDocument.class); > AvroOutputFormat.setAvroOutputClass(conf, AvroDocument.class); > {code} > > AvroInputFormat provides the specified class as the key and a LongWritable > file offset as the value. > AvroOutputFormat expects the specified class as the key and expects a > NullWritable as a value. > > If an avro serializable class is passed between the map and reduce phases > it is necessary to set the following: > > {code} > AvroComparator.setSchema(AvroDocument._SCHEMA); > conf.setClass("mapred.output.key.comparator.class", > AvroComparator.class, RawComparator.class); > {code} > > So far I've been using avro 'specific' serialization, which compiles an > avro schema into a Java class. see > src/main/schemata/org/apache/mahout/avro/AvroDocument.avsc. This is > currently compiled into classes o.a.m.avro.document (AvroDocument|AvroField) > using o.a.m.avro.util.AvroDocumentCompiler (eventually to be replaced by a > maven plugin, Generated sources are currently checked in.). > > Helper classes for AvroDocument and AvroField include > o.a.m.avro.document.Avro(Document|Field)Builder, > o.a.m.avro(Document|Field)Reader. This seems to work ok here, but I'm not > certain that this is be best pattern to use, especially when there are many > pre-existing classes (such as there are in the case of vector. > > Avro also provides reflection-based serialization and schema-based > serialization, both should be supported by the infrastructure that has been > backported here, but that's something else to explore. > > Examples: > > These are quick and dirty and need much cleanup work before they can be > taken out to the dance. > > see o.a.m.avro.text, o.a.m.avro.text.mapred and o.a.m.avro.text.mapreduce: > > * AvroDocumentsFromDirectory: quick and dirty port of > SequenceFilesFromDirectory to use AvroDocuments. Writes a file containing > documents in avro format; file contents is stored in a single field named > 'content', contents are stored in the originalText portion of this field. > * AvroDocumentsDumper: dump an avro documents file to a standard output > * AvroDocumentsWordCount: perform a wordcount on an avro document input > file. > * AvroDocumentProcessor: tokenizes the text found in the input document > file, reads from the originalText of the field named content and writes > original document+tokens to output file. > > Running the examples: > > (haven't tested with the hadoop driver yet) > > {code} > mvn exec:java > -Dexec.mainClass=org.apache.mahout.avro.text.AvroDocumentsFromDirectory \ > -Dexec.args='--parent /home/drew/mahout/20news-18828 \ > --outputDir /home/drew/mahout/20news-18828-example \ > --charset UTF-8' > > mvn exec:java > -Dexec.mainClass=org.apache.mahout.avro.text.mapred.AvroDocumentProcessor \ > -Dexec.args='/home/drew/mahout/20news-18828-example > /home/drew/mahout/20news-18828-processed' > > mvn exec:java > -Dexec.mainClass=org.apache.mahout.avro.text.AvroDocumentsDumper \ > -Dexec.args='/home/drew/mahout/20news-18828-processed/.avro-r-00000' > > foobar.txt > {code} > > The Wikipedia stuff is in there, but isn't working yet. Many thanks > (apologies) to Robin for the starting point for much of this code and > hacking it to pieces so badly. > > > > Use avro for serialization of structured documents. > > --------------------------------------------------- > > > > Key: MAHOUT-274 > > URL: https://issues.apache.org/jira/browse/MAHOUT-274 > > Project: Mahout > > Issue Type: Improvement > > Reporter: Drew Farris > > Priority: Minor > > Attachments: mahout-avro-examples.tar.bz, > mahout-avro-examples.tar.gz > > > > > > Explore the intersection between Writables and Avro to see how > serialization can be improved within Mahout. > > An intermediate goal is the provide a structured document format that can > be serialized using Avro as an Input/OutputFormat and Writable > > -- > This message is automatically generated by JIRA. > - > You can reply to this email to add a comment to the issue online. > >