[ 
https://issues.apache.org/jira/browse/NUTCH-314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081692#comment-13081692
 ] 

Lewis John McGibbney commented on NUTCH-314:
--------------------------------------------

As language identification is being delegated to Tika as per NUTCH-1075 is it 
fair to say that future releases of Nutch will not be concerned with caching 
NGramEntry's?

This seems like quite a lot of overhead (if included on the radar) considering 
other much more important issues concerning Nutch at the moment.

I propose we close this issue as won't fix as it has obviously ont gained much 
backing during its 5 year existence. Happy belated 5th bday Nutch 314 ;) 

> Multiple language identifier instances
> --------------------------------------
>
>                 Key: NUTCH-314
>                 URL: https://issues.apache.org/jira/browse/NUTCH-314
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.8
>         Environment: OS: Linux RHEL 4
> JDK: 1.5_07
>            Reporter: Enrico Triolo
>
> In my application I often need to perform the inject -> generate -> .. -> 
> index loop multiple times, since users can 'suggest' new web pages to be 
> crawled and indexed.
> I also need to enable the language identifier plugin.
> Everything seems to work correctly, but after some time I get an 
> OutOfMemoryException. Actually the time isn't important, since I noticed that 
> the problem arises when the user submits many urls (~100). As I said, for 
> each submitted url a new loop is performed (similar to the one in the 
> Crawl.main method).
> Using a profiler (specifically, netbeans profiler) I found out that for each 
> submitted url a new LanguageIdentifier instance is created, and never 
> released. With the memory inspector tool I can see as many instances of 
> LanguageIdentifier and NGramProfile$NGramEntry as the number of fetched 
> pages, each of them occupying about 180kb. Forcing garbage collection doesn't 
> release much memory.
> Maybe we should cache its instance in the conf as we do for many others 
> objects in Nutch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to