Putting data objects in the Configuration is a bit of a misuse (it has been the subject of an argument on the hadoop mailing lists for a long time now).
I would leave this use in place for now and later refactor to read from HDFS. That has more legs in any case when it comes to using the clustering on new data without retraining. On Sun, Jan 16, 2011 at 2:59 AM, Sean Owen (JIRA) <[email protected]> wrote: > > [ > https://issues.apache.org/jira/browse/MAHOUT-510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12982278#action_12982278] > > Sean Owen commented on MAHOUT-510: > ---------------------------------- > > (BTW I'm not committing this for some time.) > > I've managed to take out almost all the usages. The only real usage of it > is in the dirichlet implementation, which uses it to serialize a > ModelDistribution and pass it as a string to Hadoop workers via the > Configuration object. > > Now, per the issue description, we could re-do serialization here to use > Writable. That's not hard and makes it possible to write these things out to > HDFS later in a more Hadoop-ish way later. But that gives you a > serialization to bytes, not String. I could Base64-encode it; it's not huge. > > That's starting to get a little weird. Is the better answer to look at > writing the ModelDistribution to HDFS? or just leave this use of JSON? > > > Standardize serialization mechanisms > > ------------------------------------ > > > > Key: MAHOUT-510 > > URL: https://issues.apache.org/jira/browse/MAHOUT-510 > > Project: Mahout > > Issue Type: Task > > Affects Versions: 0.4 > > Reporter: Sean Owen > > Fix For: 0.5 > > > > Attachments: MAHOUT-510.patch > > > > > > At the moment this is tracking a broader concern: to standardize as much > as possible how we approach serialization. The long-term goal is notionally > to use the following "encodings" as the input/output of Mahout stuff, and by > extension, probably internally too. > > - Text > > - Vector Writable > > - (maybe Avro) > > not > > - Serializable > > - GSON / JSON > > -- > This message is automatically generated by JIRA. > - > You can reply to this email to add a comment to the issue online. > >
