Markus Jelsma created NUTCH-3100:
------------------------------------
Summary: HostDB to support minimum records per host
Key: NUTCH-3100
URL: https://issues.apache.org/jira/browse/NUTCH-3100
Project: Nutch
Issue Type: Improvement
Components: hostdb
Affects Versions: 1.20
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Fix For: 1.21
One of our crawls contains millions of hosts. Reading a HostDB this in the
Generator eats quite a bit of memory. We only tune the Generator using HostDB
for large hosts, so if we limit records being recorded in the HostDB using a
minimum number of URLS/host, our HostDB gets considerably smaller.
Adds -urlLimit <N> to UpdateHostDB tool. Only hosts having at least N records
will be recorded.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)