So, I took the stock Lucene 'IndexFiles' class. I modified it to read UTF-8. I ran it.
I ran the following: java -cp $cp org.apache.mahout.utils.vectors.lucene.Driver --dir he_lucene_index \ --output he_mahout_vector --field contents --dictOut he_mahout_dict \ --idField path and am rewarded with a tiny file of vectors. Clearly I'm messing something up.
