[ 
https://issues.apache.org/jira/browse/NUTCH-356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12503265
 ] 

Enzo Michelangeli commented on NUTCH-356:
-----------------------------------------

I have come to the conclusion that the right thing to do is to override the 
hashCode() method of org.apache.hadoop.conf.Configuration to make it comply 
with the specifications of the Map interface. See:

http://www.nabble.com/Re%3A-Loading-mechanism-of-plugin-classes-and-singleton-objects-tf3893584.html

Enzo


> Plugin repository cache can lead to memory leak
> -----------------------------------------------
>
>                 Key: NUTCH-356
>                 URL: https://issues.apache.org/jira/browse/NUTCH-356
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.8
>            Reporter: Enrico Triolo
>         Attachments: NutchTest.java, patch.txt
>
>
> While I was trying to solve a problem I reported a while ago (see Nutch-314), 
> I found out that actually the problem was related to the plugin cache used in 
> class PluginRepository.java.
> As  I said in Nutch-314, I think I somehow 'force' the way nutch is meant to 
> work, since I need to frequently submit new urls and append their contents to 
> the index; I don't (and I can't) have an urls.txt file with all urls I'm 
> going to fetch, but I recreate it each time a new url is submitted.
> Thus,  I think in the majority of times you won't have problems using nutch 
> as-is, since the problem I found occours only if nutch is used in a way 
> similar to the one I use.
> To simplify your test I'm attaching a class that performs something similar 
> to what I need. It fetches and index some sample urls; to avoid webmasters 
> complaints I left the sample urls list empty, so you should modify the source 
> code and add some urls.
> Then you only have to run it and watch your memory consumption with top. In 
> my experience I get an OutOfMemoryException after a couple of minutes, but it 
> clearly depends on your heap settings and on the plugins you are using (I'm 
> using 
> 'protocol-file|protocol-http|parse-(rss|html|msword|pdf|text)|language-identifier|index-(basic|more)|query-(basic|more|site|url)|urlfilter-regex|summary-basic|scoring-opic').
> The problem is bound to the PluginRepository 'singleton' instance, since it 
> never get released. It seems that some class maintains a reference to it and 
> this class is never released since it is cached somewhere in the 
> configuration.
> So I modified the PluginRepository's 'get' method so that it never uses the 
> cache and always returns a new instance (you can find the patch in 
> attachment). This way the memory consumption is always stable and I get no 
> OOM anymore.
> Clearly this is not the solution, since I guess there are many performance 
> issues involved, but for the moment it works.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to