[
https://issues.apache.org/jira/browse/NUTCH-3100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-3100:
---------------------------------
Attachment: NUTCH-3100.patch
> HostDB to support minimum records per host
> ------------------------------------------
>
> Key: NUTCH-3100
> URL: https://issues.apache.org/jira/browse/NUTCH-3100
> Project: Nutch
> Issue Type: Improvement
> Components: hostdb
> Affects Versions: 1.20
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Major
> Fix For: 1.21
>
> Attachments: NUTCH-3100.patch
>
>
> One of our crawls contains millions of hosts. Reading a HostDB this in the
> Generator eats quite a bit of memory. We only tune the Generator using HostDB
> for large hosts, so if we limit records being recorded in the HostDB using a
> minimum number of URLS/host, our HostDB gets considerably smaller.
>
> Adds -urlLimit <N> to UpdateHostDB tool. Only hosts having at least N records
> will be recorded.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)