[
https://issues.apache.org/jira/browse/NUTCH-3100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17906745#comment-17906745
]
Markus Jelsma commented on NUTCH-3100:
--------------------------------------
Patch for master!
> HostDB to support minimum records per host
> ------------------------------------------
>
> Key: NUTCH-3100
> URL: https://issues.apache.org/jira/browse/NUTCH-3100
> Project: Nutch
> Issue Type: Improvement
> Components: hostdb
> Affects Versions: 1.20
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Major
> Fix For: 1.21
>
> Attachments: NUTCH-3100.patch
>
>
> One of our crawls contains millions of hosts. Reading a HostDB this in the
> Generator eats quite a bit of memory. We only tune the Generator using HostDB
> for large hosts, so if we limit records being recorded in the HostDB using a
> minimum number of URLS/host, our HostDB gets considerably smaller.
>
> Adds -urlLimit <N> to UpdateHostDB tool. Only hosts having at least N records
> will be recorded.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)