Hi Delroy,
You did not say if you were using 0.3 or trunk; I suggest trunk since it
has been recently better integrated with Dirichlet. Looking at your code
fragment and comparing it with what the ClusterDumper is (now) doing:
SequenceFile.Reader reader = new SequenceFile.Reader(fs, path, conf);
Writable key = (Writable) reader.getKeyClass().newInstance();
Writable value = (Writable) reader.getValueClass().newInstance();
while (reader.next(key, value)) {
Cluster cluster = (Cluster) value;
String fmtStr = useJSON ? cluster.asJsonString() :
cluster.asFormatString(dictionary);
... it kinda looks like you are not actually reading in the clusters and
their models; rather just creating a new instance of DirichletCluster
(the value class). This approach will not read in the model or any of
the cluster state, hence your observations. You should be able to just
run the ClusterDumper by pointing at your cluster directory as in
TestClusterDumper.testDirichlet.
If you really want to write your own code for reading the clusters, I
suggest copying the above and remembering to create a new value object
in your loop otherwise the first instance will be reused by the reader
and you will end up with all your clusters being identical. Something
like this:
while (reader.next(key, value)) {
DirichletCluster cluster = (DirichletCluster) value;
String fmtStr = useJSON ? cluster.asJsonString() :
cluster.asFormatString(dictionary);
<save the cluster in some data structure>
value = (Writable) reader.getValueClass().newInstance();
}
Let me know how it goes,
Jeff
On 5/4/10 5:54 PM, Delroy Cameron wrote:
so i've run Dirichlet Clustering using Mahout and i'm trying to see the
clusterdump. Of course i'm using a combination of ClusterDumper,
DirichletOutputState and DirichletCluster and TestL1ModelClustering to help
with the output.
so far i've successfully read each file in each state-x output folder. The
issue is that the vectors appear to be serialized as<Text,
DirichletCluster> pairs in each binary dump, which is fine. However, after
debugging it turns out that the model for each DirichletCluster is
null....and this make sense, since i'm reading from the dump file as
follows:
SequenceFile.Reader reader = new SequenceFile.Reader(fileSystem, inputPath,
conf);
Text key = (Text) reader.getKeyClass().newInstance();
DirichletCluster cluster = (DirichletCluster)
reader.getValueClass().newInstance();
i tried to set the fields for the DirichletCluster by using the following
method readFields(DataInput in);
DataInput istream = new DataInputStream(new FileInputStream(new
File(fileName)));
cluster.readFields(istream);
and i have a null pointer exception...
can i have a few suggestion on how to proceed here...
-----
--cheers
Delroy