Does the IndexFiles class store term vectors for the contents field?
If not, that could be the problem.

Also, you can try dumping the vector file using
o.a.m.utils.vectors.VectorDumper in mahout-utils and taking a look to
see what's in there.

Failing that, in mahout-examples, you can run ./bin/build-reuters.sh
-- that will generate a known good set of vectors and you can try
running clustering upon that. No need to let build-reuters.sh to
complete, watch stdout and kill it once the vectors are done because
it will start running lda and you're not really interested in that at
this point. Once this is run, the vectors themselves can be found in
work/vectors, dictionary in work/dict.txt (relative to the
mahout-example directory)

On Sat, Dec 19, 2009 at 7:41 PM, Benson Margulies <[email protected]> wrote:
> So,
>
> I took the stock Lucene 'IndexFiles' class. I modified it to read
> UTF-8. I ran it.
>
> I ran the following:
>
> java -cp $cp org.apache.mahout.utils.vectors.lucene.Driver --dir
> he_lucene_index \
>   --output he_mahout_vector --field contents --dictOut he_mahout_dict \
>   --idField path
>
> and am rewarded with a tiny file of vectors. Clearly I'm messing something up.
>

Reply via email to