[ https://issues.apache.org/jira/browse/MAHOUT-270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12806873#action_12806873 ]
Jeff Eastman commented on MAHOUT-270: ------------------------------------- In the beginning, vectors, canopies and clusters needed a serialization mechanism and asFormatString() was invented. Also invented but not expressed in interfaces were their deserialization counterparts, static methods decodeCanopy(), decodeCluster() and decodeVector(). These ad-hoc encodings worked adequately for a time but were soon replaced by standard Json encodings as newer entities embodied more complicated state and the ad-hoc methods became unworkable. Shortly after that, the quest for speed (and improvements in Hadoop support) led to the adoption of Writable encodings and SequenceFiles by all Mahout entities. Of course, binary encodings are impossible to use for debugging so some clustering entities use asFormatString() as their toString() implementations and also as a human-readable option for final output. As more kinds of clustering were implemented some refactoring was indicated and ClusterBase was invented to abstract out the center and centroid calculations common among them. Then came Dirichlet which has no notion of centers, nor centroid calculations so it makes little sense to generalize them under ClusterBase. DirichletClusters have only a domain-specific Model and totalCount and these are serialized/deserialized entirely using Writable (asFormatString() only prints the model's toString() output and there is no decode() static method). Even more recently, users doing text clustering needed better sparse vector implementations and utilities for working with term vectors. ClusterDumper and VectorHelper utilities were added to meet these needs. ClusterDumper can output either a Json encoding of the center of a cluster or a VectorHelper.vectorToString() representation which can include a term dictionary to make the output more human-readable. It should now be obvious to all that making ClusterDumper dump DirichletClusters too will take some serious refactoring. I have some thoughts about how to accomplish that, but it seems to be a good time to revisit the user requirements so we do not perpetuate unnecessary or obsolete stuff. Could I get some comments on the following requirements? 1. We need an efficient, binary encoding for serialization and deserialization. (I take this as a given and that Writable is it, but feel free to disagree) 2. We need a Json encoding encoding for serialization and deserialization. 3. We need a complete, human-readable encoding for output only. (Json qualifies here) 4. We need a human-readable encoding for output only. (Json qualifies here too but others may be more usable) 5. We need a human-readable toString() encoding for debugging only. > Make ClusterDumper dump Dirichlet clusters too > ---------------------------------------------- > > Key: MAHOUT-270 > URL: https://issues.apache.org/jira/browse/MAHOUT-270 > Project: Mahout > Issue Type: Improvement > Components: Clustering > Affects Versions: 0.2 > Reporter: Jeff Eastman > Assignee: Jeff Eastman > > Given the binary representation of models/clusters in Dirichlet, extend the > ClusterDumper utility to dump out a printable representation of them too. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.