Re: Re: Issues with memory use and inconsistent or state-influenced results when using CBayesAlgorithm

Ferdy Wed, 31 Mar 2010 03:13:46 -0700

Hi,

Thanks for responding. I will fill in the details in addition to thosealready provided by my colleague Loek. It applies to the Mahout release 0.3.


Robin Anil wrote:

On Tue, Mar 30, 2010 at 10:45 PM, Loek Cleophas
<loek.cleop...@kalooga.com>wrote:

Hi,

after my initial experiments with Mahout's Bayes and CBayes implementations
on my company's dataset, we're now trying to integrate Mahout to classify
our data in a production environment. We are however running into two odd
issues, after having succesfuly trained a classifier (using CBayes).

We're loading the trained model into an InMemoryBayesDataStore, and are
able to get classification results (i.e. categories plus weights). However,
we're seeing two odd issues:

1) it turns out the classifier's memory use increases by classifying a
document; as a result, after a number of documents to classify, we run into
memory issues.
2) somehow, classification is not consistent: e.g. if we classify text 1,
2, 3, and then 1 again, the second time text 1 is fed, we get slightly
different weights - not by a lot, but not by little enough to discard it as
floating point rounding issues; and if we classify text 1 and then 1 again
without any intermediate classification on other texts, the weights do not
change.

This shouldn't be happening. I mean there is nothing getting changed in
there.

I believe this not the case. The field featureDictionary inInMemoryBayesDatastore is the cause of the severe memory growth and theinconsistent classifying results. This is how I noticed it: In a smalltestcase, I'll load a classifier and start feeding a few documents. TheAlgorithm is CBayesAlgorithm and the Datastore isInMemoryBayesDatastore. After the classifier is fully loaded, that is,after initialize() is called, I start debugging the Mahout code. Itseems that 'features' are put into the featureDictionary DURINGclassifying. Here's a stacktrace:

Thread [main] (Suspended (breakpoint at line 143 inInMemoryBayesDatastore))InMemoryBayesDatastore.getFeatureID(String) line: 143InMemoryBayesDatastore.getWeight(String, String, String) line: 106CBayesAlgorithm.featureWeight(Datastore, String, String) line: 93CBayesAlgorithm$1.apply(String, int) line: 136CBayesAlgorithm$1.apply(Object, int) line: 131OpenObjectIntHashMap<T>.forEachPair(ObjectIntProcedure<T>) line: 186CBayesAlgorithm.documentWeight(Datastore, String, String[]) line:131CBayesAlgorithm.classifyDocument(String[], Datastore, String, int)line: 69ClassifierContext.classifyDocument(String[], String, int) line: 81

Line 143 is a 'put' into the featureDictionary. Surely, not all featurescontained in the documents are put, as some are already contained in it(perhaps loaded during the initializing stage of the classifier).Because the featureDictionary is used in the calcution of classifying,results are not always inconsistent. The order of the documents and thusthe order of the first appearances of features seem to influence thelabel scores for each document.

My colleagues and I have looked at the Mahout code, and it seems that the
memory use increase is due to getLabelID in InMemoryBayesDatastore - which
adds a label to a dictionary if it's not in there yet, but never seems to
remove any labels from the dictionary. Could this be the source of the
memory issue? I can imagine that if you're adding words that were not in the
model but occur in text to be classified, this might increase memory use but
probably shouldn't be happening (as it's classification, not training).

The number of labels is fixed right? 2,3 to 100s of a few thousands and not
more.  I dont see why that should be a reason for alarm. If there is a leak
it could be because of something else.

The number of labels is indeed fixed. Therefore, on second thought, thelabelDictionary is not really the problem. (At least not solely theproblem).

Any thoughts on these two issues, whether they're related, and what to do
about them?

Robin, I suspect/hope you're able to help here?


Can you tell me the model size in memory after first load, the dictionary
size, the label size,  and the increment in memory usage after a
classification

We trained a classifier using about 37000 input docs (10MB textual data)and about 10 labels. The size of resulting classifier is 130MB on disk.

Loading the classifer in memory (after initializing) requires 600MB.During classifying, the memory footprint slowly increases. The exactgrowth after each document classification is not always the same anddifficult to measure accurately. Therefore I'll resort to totals.

The amount of input tokens per documents is the main factor in how manydocuments we are ably to classify before a "GC overhead limit exceeded"kicks in. In this particular case we're using a 1GB heapsize. Sometimeswe are able to classify 100000 documents, but with heavy documents thelimit can be as low as 20000 documents. To be certain we're dealing witha memoryleak in Mahout, I reran the classifying code with a simplechange: Instead of using a single classifier, I reload the classifierevery 1000 documents (by means of creating a new InMemoryBayesDatastoreand discarding the old one). This allows the garbage collector to cleanthe old InMemoryBayesDatastore. This change allows us the classify alldocuments that would have previously been failed.

Regards,
Loek


Robin

Regards, Ferdy

Re: Re: Issues with memory use and inconsistent or state-influenced results when using CBayesAlgorithm

Reply via email to