Hi,
Thanks for responding. I will fill in the details in addition to those
already provided by my colleague Loek. It applies to the Mahout release 0.3.
Robin Anil wrote:
On Tue, Mar 30, 2010 at 10:45 PM, Loek Cleophas
<loek.cleop...@kalooga.com>wrote:
Hi,
after my initial experiments with Mahout's Bayes and CBayes implementations
on my company's dataset, we're now trying to integrate Mahout to classify
our data in a production environment. We are however running into two odd
issues, after having succesfuly trained a classifier (using CBayes).
We're loading the trained model into an InMemoryBayesDataStore, and are
able to get classification results (i.e. categories plus weights). However,
we're seeing two odd issues:
1) it turns out the classifier's memory use increases by classifying a
document; as a result, after a number of documents to classify, we run into
memory issues.
2) somehow, classification is not consistent: e.g. if we classify text 1,
2, 3, and then 1 again, the second time text 1 is fed, we get slightly
different weights - not by a lot, but not by little enough to discard it as
floating point rounding issues; and if we classify text 1 and then 1 again
without any intermediate classification on other texts, the weights do not
change.
This shouldn't be happening. I mean there is nothing getting changed in
there.
I believe this not the case. The field featureDictionary in
InMemoryBayesDatastore is the cause of the severe memory growth and the
inconsistent classifying results. This is how I noticed it: In a small
testcase, I'll load a classifier and start feeding a few documents. The
Algorithm is CBayesAlgorithm and the Datastore is
InMemoryBayesDatastore. After the classifier is fully loaded, that is,
after initialize() is called, I start debugging the Mahout code. It
seems that 'features' are put into the featureDictionary DURING
classifying. Here's a stacktrace:
Thread [main] (Suspended (breakpoint at line 143 in
InMemoryBayesDatastore))
InMemoryBayesDatastore.getFeatureID(String) line: 143
InMemoryBayesDatastore.getWeight(String, String, String) line: 106
CBayesAlgorithm.featureWeight(Datastore, String, String) line: 93
CBayesAlgorithm$1.apply(String, int) line: 136
CBayesAlgorithm$1.apply(Object, int) line: 131
OpenObjectIntHashMap<T>.forEachPair(ObjectIntProcedure<T>) line: 186
CBayesAlgorithm.documentWeight(Datastore, String, String[]) line:
131
CBayesAlgorithm.classifyDocument(String[], Datastore, String, int)
line: 69
ClassifierContext.classifyDocument(String[], String, int) line: 81
Line 143 is a 'put' into the featureDictionary. Surely, not all features
contained in the documents are put, as some are already contained in it
(perhaps loaded during the initializing stage of the classifier).
Because the featureDictionary is used in the calcution of classifying,
results are not always inconsistent. The order of the documents and thus
the order of the first appearances of features seem to influence the
label scores for each document.
My colleagues and I have looked at the Mahout code, and it seems that the
memory use increase is due to getLabelID in InMemoryBayesDatastore - which
adds a label to a dictionary if it's not in there yet, but never seems to
remove any labels from the dictionary. Could this be the source of the
memory issue? I can imagine that if you're adding words that were not in the
model but occur in text to be classified, this might increase memory use but
probably shouldn't be happening (as it's classification, not training).
The number of labels is fixed right? 2,3 to 100s of a few thousands and not
more. I dont see why that should be a reason for alarm. If there is a leak
it could be because of something else.
The number of labels is indeed fixed. Therefore, on second thought, the
labelDictionary is not really the problem. (At least not solely the
problem).
Any thoughts on these two issues, whether they're related, and what to do
about them?
Robin, I suspect/hope you're able to help here?
Can you tell me the model size in memory after first load, the dictionary
size, the label size, and the increment in memory usage after a
classification
We trained a classifier using about 37000 input docs (10MB textual data)
and about 10 labels. The size of resulting classifier is 130MB on disk.
Loading the classifer in memory (after initializing) requires 600MB.
During classifying, the memory footprint slowly increases. The exact
growth after each document classification is not always the same and
difficult to measure accurately. Therefore I'll resort to totals.
The amount of input tokens per documents is the main factor in how many
documents we are ably to classify before a "GC overhead limit exceeded"
kicks in. In this particular case we're using a 1GB heapsize. Sometimes
we are able to classify 100000 documents, but with heavy documents the
limit can be as low as 20000 documents. To be certain we're dealing with
a memoryleak in Mahout, I reran the classifying code with a simple
change: Instead of using a single classifier, I reload the classifier
every 1000 documents (by means of creating a new InMemoryBayesDatastore
and discarding the old one). This allows the garbage collector to clean
the old InMemoryBayesDatastore. This change allows us the classify all
documents that would have previously been failed.
Regards,
Loek
Robin
Regards, Ferdy