Hany Shehata created NUTCH-2703:

             Summary: Boilerpipe should not run for non-(X)HTML pages
                 Key: NUTCH-2703
                 URL: https://issues.apache.org/jira/browse/NUTCH-2703
             Project: Nutch
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.15
            Reporter: Hany Shehata
             Fix For: 1.16

Boilerpipe is running for non-(X)html pages which is require more resources.

In my testing scenario, I've large PDFs in my websites and by enabling 
Boilerpipe I have to assign 8500MB for JAVA Heap to finish the crawl job 
without issues.

Disabling Boilerpipe allow me to minimize the JVM Heap to 500MB with no issues.

This message was sent by Atlassian JIRA

Reply via email to