I don't know if it helps, but I have a sparse vector file that is based off a 1.8 MB Lucene index and it takes up 143 kb. Earlier, I had a Lucene index that was several megs (20+) and the vectors only took 1 mb.
Have you tried debugging? If I can finish up my chapter tonight, I will try to take a closer look. On Jan 10, 2010, at 6:45 AM, Robin Anil wrote: > I have been testing out the DictionaryVectorizer on 20news dataset. Its > writing out 2GB vector files for the 38MB dataset > > This is what i am doing. Tell me where I am going wrong > > First I create an infinite dimensional vector of size 10, > SparseVector vector = new SparseVector(key.toString(), Integer.MAX_VALUE, > 10); > > Foreach(word => int id : dictionary) > vector.setQuick(dictionary.get(word), weight); > > output.write(docid, vector) > Robin -------------------------- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search