Multiple language identifier instances
--------------------------------------
Key: NUTCH-314
URL: http://issues.apache.org/jira/browse/NUTCH-314
Project: Nutch
Type: Bug
Versions: 0.8-dev
Environment: OS: Linux RHEL 4
JDK: 1.5_07
Reporter: Enrico Triolo
In my application I often need to perform the inject -> generate -> .. -> index
loop multiple times, since users can 'suggest' new web pages to be crawled and
indexed.
I also need to enable the language identifier plugin.
Everything seems to work correctly, but after some time I get an
OutOfMemoryException. Actually the time isn't important, since I noticed that
the problem arises when the user submits many urls (~100). As I said, for each
submitted url a new loop is performed (similar to the one in the Crawl.main
method).
Using a profiler (specifically, netbeans profiler) I found out that for each
submitted url a new LanguageIdentifier instance is created, and never released.
With the memory inspector tool I can see as many instances of
LanguageIdentifier and NGramProfile$NGramEntry as the number of fetched pages,
each of them occupying about 180kb. Forcing garbage collection doesn't release
much memory.
Maybe we should cache its instance in the conf as we do for many others objects
in Nutch.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira