Multiple language identifier instances
--------------------------------------

         Key: NUTCH-314
         URL: http://issues.apache.org/jira/browse/NUTCH-314
     Project: Nutch
        Type: Bug

    Versions: 0.8-dev    
 Environment: OS: Linux RHEL 4
JDK: 1.5_07

    Reporter: Enrico Triolo


In my application I often need to perform the inject -> generate -> .. -> index 
loop multiple times, since users can 'suggest' new web pages to be crawled and 
indexed.
I also need to enable the language identifier plugin.

Everything seems to work correctly, but after some time I get an 
OutOfMemoryException. Actually the time isn't important, since I noticed that 
the problem arises when the user submits many urls (~100). As I said, for each 
submitted url a new loop is performed (similar to the one in the Crawl.main 
method).

Using a profiler (specifically, netbeans profiler) I found out that for each 
submitted url a new LanguageIdentifier instance is created, and never released. 
With the memory inspector tool I can see as many instances of 
LanguageIdentifier and NGramProfile$NGramEntry as the number of fetched pages, 
each of them occupying about 180kb. Forcing garbage collection doesn't release 
much memory.

Maybe we should cache its instance in the conf as we do for many others objects 
in Nutch.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to