Markus Jelsma created NUTCH-3100: ------------------------------------ Summary: HostDB to support minimum records per host Key: NUTCH-3100 URL: https://issues.apache.org/jira/browse/NUTCH-3100 Project: Nutch Issue Type: Improvement Components: hostdb Affects Versions: 1.20 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.21
One of our crawls contains millions of hosts. Reading a HostDB this in the Generator eats quite a bit of memory. We only tune the Generator using HostDB for large hosts, so if we limit records being recorded in the HostDB using a minimum number of URLS/host, our HostDB gets considerably smaller. Adds -urlLimit <N> to UpdateHostDB tool. Only hosts having at least N records will be recorded. -- This message was sent by Atlassian Jira (v8.20.10#820010)