Markus Jelsma created NUTCH-3100:
------------------------------------

             Summary: HostDB to support minimum records per host
                 Key: NUTCH-3100
                 URL: https://issues.apache.org/jira/browse/NUTCH-3100
             Project: Nutch
          Issue Type: Improvement
          Components: hostdb
    Affects Versions: 1.20
            Reporter: Markus Jelsma
            Assignee: Markus Jelsma
             Fix For: 1.21


One of our crawls contains millions of hosts. Reading a HostDB this in the 
Generator eats quite a bit of memory. We only tune the Generator using HostDB 
for large hosts, so if we limit records being recorded in the HostDB using a 
minimum number of URLS/host, our HostDB gets considerably smaller.

 

Adds -urlLimit <N> to UpdateHostDB tool. Only hosts having at least N records 
will be recorded.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to