Re: Getting Started with Classification

Grant Ingersoll Wed, 22 Jul 2009 09:14:58 -0700


On Jul 22, 2009, at 11:50 AM, Robin Anil wrote:

On Wed, Jul 22, 2009 at 8:55 PM, Grant Ingersoll<[email protected]>wrote:
On Jul 22, 2009, at 10:38 AM, Robin Anil wrote:
Dear Grant, Could you post some stats like the numberof
labels and features that you have and the number of uniquelabel,feature
pair.
labels: history and science
Docs trained on: chunk 1 - 60 generated using the WikipediaSplitter with
the WikipediaAnalyzer (MAHOUT-146) with chunk size set to 64

Where are the <label,feature> values stored?
tf-Idf Folder part-****


That's 1.28 GB.  Count: 31216595

(FYI, I modified the SequenceFileDumper to spit out counts from aSeqFile)

Both Naive bayes and Complementary naive bayes use the same data
except Sigma_j set.
So, why do I need to load it or even calculate it if I am usingBayes? Ithink I would like to have the choice. That is, if I plan on usingboth,then I can calculate/load both. At a minimum, when classifyingwith Bayes,
we should not be loading it, even if we did calculate it.


Thoughts on this?  Can I disable it for Bayes?

Could you add some writeup on http://cwiki.apache.org/MAHOUT/bayesian.htmlaboutthe steps that are taken? Also, I've read the CNB paper, but doyou have a
reference for the NB part using many of these values?
Sure. I will ASAP
But regardless the matrix stored is sparse. I am not
surprised that with a larger set like that you have taken, memorylimit
was
crossed. Another thing the number of unique terms in wikipedia isquitelarge. So best choice for you right now is to use the Hbasesolution. The
large matrix is stored easily on the it. I am currently writing the
Distributed version of Hbase classification for parallelizing.
HBase isn't an option right now, as it isn't committed and I'mputting
together a demo on current capabilities.



Robin
On Wed, Jul 22, 2009 at 4:53 PM, Grant Ingersoll<[email protected]
wrote:
The other thing is, I don't even think Sigma_J is even used forBayes,
only
Complementary Bayes.



On Jul 22, 2009, at 7:16 AM, Grant Ingersoll wrote:
AFAICT, It is loading the Sum Feature Weights, stored in theSigma_J
directory under the model. For me, this file is 1.04 GB. Thevalues inthis file are loaded into a List of Doubles (which brings withit a
whole
log of auto-boxing, too). It seems like that should fit inmemory,
especially since it is the first thing loaded, AFAICT.  I have not
looked
yet into the structure of the file itself.
I guess I will have to dig deeper, this code has changed a lotfrom when
I
first wrote it as a very simple naive bayes model to one that now
appears to
be weighted by TF-IDF, normalization, etc. and I need tounderstand it
better.

On Jul 22, 2009, at 12:26 AM, Ted Dunning wrote:
This is kind of surprising. It would seem that this modelshouldn't
have
more than a few doubles per unique term and there should be<half a
million
terms. Even with pretty evil data structures, this reallyshouldn't be
more
than a few hundred megs for the model alone.
Sparsity *is* a virtue with these models and I always try toeliminate
terms
that might as well have zero value, but that doesn't sound likethe
root
problem here.

Regarding strings or Writables, strings have the wonderful
characteristic
that they cache their hashed value. This means that hash mapsare
nearly
as
fast as arrays because you wind up indexing to nearly the rightplace
and
then do a few (or one) integer compare to find the rightvalue. Custom
data
types rarely do this and thus wind up slow.

On Tue, Jul 21, 2009 at 7:41 PM, Grant Ingersoll <[email protected]
wrote:
I trained on a couple of categories (history and science) onquite a
few
docs, but now the model is so big, I can't load it, even withalmost 3
GB of
memory.
--
Ted Dunning, CTO
DeepDyve
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)using
Solr/Lucene:
http://www.lucidimagination.com/search


--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)using Solr/Lucene:

http://www.lucidimagination.com/search

Re: Getting Started with Classification

Reply via email to