[ 
https://issues.apache.org/jira/browse/NUTCH-2407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vyacheslav Pascarel updated NUTCH-2407:
---------------------------------------
    Attachment: second.txt
                started.txt
                first.txt

Here are outputs of jmap: 
# After Nutch server was started -> started.txt
# After first execution -> first.txt
# After second execution -> second.txt

NOTE: Custom configurations & job history was cleaned and GC was done after 
each execution and before jmap

> Memory leak causing Nutch Server to run out of memory
> -----------------------------------------------------
>
>                 Key: NUTCH-2407
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2407
>             Project: Nutch
>          Issue Type: Bug
>          Components: nutch server
>    Affects Versions: 2.3.1
>         Environment: Ubuntu 16.04 64-bit
> Oracle Java 8 64-bit
> Nutch 2.3.1 (standalone deployment)
> MongoDB 3.4
>            Reporter: Vyacheslav Pascarel
>         Attachments: first.txt, second.txt, started.txt
>
>
> My application is trying to perform continuous crawling using Nutch REST 
> services. The application injects a seed URL and then repeats 
> GENERATE/FETCH/PARSE/UPDATEDB sequence requested number of times (each step 
> in the sequence is executed upon successful competition of the previous step 
> then the whole sequence is repeated again). Here is a brief description of 
> the job:
> * Number of GENERATE/FETCH/PARSE/UPDATEDB cycles per run: 50
> * 'topN' parameter value of GENERATE step in each cycle: 10
> * Seed URL: http://www.cnn.com
> * Regex URL filters for all jobs: 
> ** *"-^.\{1000,\}$"* - exclude very long URLs
> ** *"+."* - include the rest
> To monitor Nutch server I use Java VisualVM that comes with Java SDK. After 
> each run (50 cycles of GENERATE/FETCH/PARSE/UPDATEDB) I perform garbage 
> collection using the mentioned tool and check memory usage. My observation is 
> that Nutch Server leaks ~25MB per run.
> NOTES: I added custom HTTP DELETE services to clean job history in 
> NutchServerPoolExecutor and remove all custom configurations from 
> RAMConfManager after each run. So observed ~25MB memory leak is after job 
> history/configuration cleanup.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to