I don't know if it helps, but I have a sparse vector file that is based off a 
1.8 MB Lucene index and it takes up 143 kb.  Earlier, I had a Lucene index that 
was several megs (20+) and the vectors only took 1 mb.

Have you tried debugging?  If I can finish up my chapter tonight, I will try to 
take a closer look.

On Jan 10, 2010, at 6:45 AM, Robin Anil wrote:

> I have been testing out the DictionaryVectorizer on 20news dataset. Its
> writing out 2GB vector files for the 38MB dataset
> 
> This is what i am doing. Tell me where I am going wrong
> 
> First I create an infinite dimensional vector of size 10,
> SparseVector vector = new SparseVector(key.toString(), Integer.MAX_VALUE,
>          10);
> 
> Foreach(word => int id : dictionary)
>  vector.setQuick(dictionary.get(word), weight);
> 
> output.write(docid, vector)
> Robin

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search

Reply via email to