[ 
https://issues.apache.org/jira/browse/MAHOUT-270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12806873#action_12806873
 ] 

Jeff Eastman commented on MAHOUT-270:
-------------------------------------

In the beginning, vectors, canopies and clusters needed a serialization 
mechanism and asFormatString() was invented. Also invented but not expressed in 
interfaces were their deserialization counterparts, static methods 
decodeCanopy(), decodeCluster() and decodeVector(). These ad-hoc encodings 
worked adequately for a time but were soon replaced by standard Json encodings 
as newer entities embodied more complicated state and the ad-hoc methods became 
unworkable. Shortly after that, the quest for speed (and improvements in Hadoop 
support) led to the adoption of Writable encodings and SequenceFiles by all 
Mahout entities.

Of course, binary encodings are impossible to use for debugging so some 
clustering entities use asFormatString() as their toString() implementations 
and also as a human-readable option for final output. As more kinds of 
clustering were implemented some refactoring was indicated and ClusterBase was 
invented to abstract out the center and centroid calculations common among 
them. Then came Dirichlet which has no notion of centers, nor centroid 
calculations so it makes little sense to generalize them under ClusterBase. 
DirichletClusters have only a domain-specific Model and totalCount and these 
are serialized/deserialized entirely using Writable (asFormatString() only 
prints the model's toString() output and there is no decode() static method).

Even more recently, users doing text clustering needed better sparse vector 
implementations and utilities for working with term vectors. ClusterDumper and 
VectorHelper utilities were added to meet these needs. ClusterDumper can output 
either a Json encoding of the center of a cluster or a 
VectorHelper.vectorToString() representation which can include a term 
dictionary to make the output more human-readable.

It should now be obvious to all that making ClusterDumper dump 
DirichletClusters too will take some serious refactoring. I have some thoughts 
about how to accomplish that, but it seems to be a good time to revisit the 
user requirements so we do not perpetuate unnecessary or obsolete stuff. Could 
I get some comments on the following requirements?

1. We need an efficient, binary encoding for serialization and deserialization. 
(I take this as a given and that Writable is it, but feel free to disagree)
2. We need a Json encoding encoding for serialization and deserialization. 
3. We need a complete, human-readable encoding for output only. (Json qualifies 
here)
4. We need a human-readable encoding for output only. (Json qualifies here too 
but others may be more usable)
5. We need a human-readable toString() encoding for debugging only.

> Make ClusterDumper dump Dirichlet clusters too
> ----------------------------------------------
>
>                 Key: MAHOUT-270
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-270
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.2
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>
> Given the binary representation of models/clusters in Dirichlet, extend the 
> ClusterDumper utility to dump out a printable representation of them too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to