[
https://issues.apache.org/jira/browse/NUTCH-3100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17911470#comment-17911470
]
Hudson commented on NUTCH-3100:
-------------------------------
SUCCESS: Integrated in Jenkins build Nutch » Nutch-trunk #185 (See
[https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/185/])
NUTCH-3100 HostDB to support minimum records per host (markus:
[https://github.com/apache/nutch/commit/b52ec9025e40152b3a1dae7c78bb803c7ad298ce])
* (edit) src/java/org/apache/nutch/hostdb/UpdateHostDb.java
* (edit) src/java/org/apache/nutch/hostdb/UpdateHostDbReducer.java
> HostDB to support minimum records per host
> ------------------------------------------
>
> Key: NUTCH-3100
> URL: https://issues.apache.org/jira/browse/NUTCH-3100
> Project: Nutch
> Issue Type: Improvement
> Components: hostdb
> Affects Versions: 1.20
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Major
> Fix For: 1.21
>
> Attachments: NUTCH-3100.patch
>
>
> One of our crawls contains millions of hosts. Reading a HostDB this in the
> Generator eats quite a bit of memory. We only tune the Generator using HostDB
> for large hosts, so if we limit records being recorded in the HostDB using a
> minimum number of URLS/host, our HostDB gets considerably smaller.
>
> Adds -urlLimit <N> to UpdateHostDB tool. Only hosts having at least N records
> will be recorded.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)