On Jul 22, 2009, at 11:50 AM, Robin Anil wrote:
On Wed, Jul 22, 2009 at 8:55 PM, Grant Ingersoll
<[email protected]>wrote:
On Jul 22, 2009, at 10:38 AM, Robin Anil wrote:
Dear Grant, Could you post some stats like the number
of
labels and features that you have and the number of unique
label,feature
pair.
labels: history and science
Docs trained on: chunk 1 - 60 generated using the Wikipedia
Splitter with
the WikipediaAnalyzer (MAHOUT-146) with chunk size set to 64
Where are the <label,feature> values stored?
tf-Idf Folder part-****
That's 1.28 GB. Count: 31216595
(FYI, I modified the SequenceFileDumper to spit out counts from a
SeqFile)
Both Naive bayes and Complementary naive bayes use the same data
except Sigma_j set.
So, why do I need to load it or even calculate it if I am using
Bayes? I
think I would like to have the choice. That is, if I plan on using
both,
then I can calculate/load both. At a minimum, when classifying
with Bayes,
we should not be loading it, even if we did calculate it.
Thoughts on this? Can I disable it for Bayes?
Could you add some writeup on http://cwiki.apache.org/MAHOUT/bayesian.html
about
the steps that are taken? Also, I've read the CNB paper, but do
you have a
reference for the NB part using many of these values?
Sure. I will ASAP
But regardless the matrix stored is sparse. I am not
surprised that with a larger set like that you have taken, memory
limit
was
crossed. Another thing the number of unique terms in wikipedia is
quite
large. So best choice for you right now is to use the Hbase
solution. The
large matrix is stored easily on the it. I am currently writing the
Distributed version of Hbase classification for parallelizing.
HBase isn't an option right now, as it isn't committed and I'm
putting
together a demo on current capabilities.
Robin
On Wed, Jul 22, 2009 at 4:53 PM, Grant Ingersoll
<[email protected]
wrote:
The other thing is, I don't even think Sigma_J is even used for
Bayes,
only
Complementary Bayes.
On Jul 22, 2009, at 7:16 AM, Grant Ingersoll wrote:
AFAICT, It is loading the Sum Feature Weights, stored in the
Sigma_J
directory under the model. For me, this file is 1.04 GB. The
values in
this file are loaded into a List of Doubles (which brings with
it a
whole
log of auto-boxing, too). It seems like that should fit in
memory,
especially since it is the first thing loaded, AFAICT. I have not
looked
yet into the structure of the file itself.
I guess I will have to dig deeper, this code has changed a lot
from when
I
first wrote it as a very simple naive bayes model to one that now
appears to
be weighted by TF-IDF, normalization, etc. and I need to
understand it
better.
On Jul 22, 2009, at 12:26 AM, Ted Dunning wrote:
This is kind of surprising. It would seem that this model
shouldn't
have
more than a few doubles per unique term and there should be
<half a
million
terms. Even with pretty evil data structures, this really
shouldn't be
more
than a few hundred megs for the model alone.
Sparsity *is* a virtue with these models and I always try to
eliminate
terms
that might as well have zero value, but that doesn't sound like
the
root
problem here.
Regarding strings or Writables, strings have the wonderful
characteristic
that they cache their hashed value. This means that hash maps
are
nearly
as
fast as arrays because you wind up indexing to nearly the right
place
and
then do a few (or one) integer compare to find the right
value. Custom
data
types rarely do this and thus wind up slow.
On Tue, Jul 21, 2009 at 7:41 PM, Grant Ingersoll <[email protected]
wrote:
I trained on a couple of categories (history and science) on
quite a
few
docs, but now the model is so big, I can't load it, even with
almost 3
GB of
memory.
--
Ted Dunning, CTO
DeepDyve
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using
Solr/Lucene:
http://www.lucidimagination.com/search
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search