[
https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14382495#comment-14382495
]
Markus Jelsma commented on NUTCH-1325:
--------------------------------------
Hello Lewis - these are the new parameters:
+ public static final String HOSTDB_NUMERIC_FIELDS = "hostdb.numeric.fields";
+ public static final String HOSTDB_STRING_FIELDS = "hostdb.string.fields";
List the crawldatum md fields that are numeric and you want stats on in
numeric, and string fields in string.fields. Run updatehostdb and you will get
HostDatum md for the selected fields. A newer version also supports median
stats on numerics. I am going to use these within memex soon!
I also plan an upgrade for dumphostdb so it can be used to let Nutch
automatically restrict the crawl to metadata field values such as only english,
or crawl only pages that are within a threshold (numeric) such as for instance
illegal content, abusive stuff, whatever.
> HostDB for Nutch
> ----------------
>
> Key: NUTCH-1325
> URL: https://issues.apache.org/jira/browse/NUTCH-1325
> Project: Nutch
> Issue Type: New Feature
> Reporter: Markus Jelsma
> Fix For: 1.11
>
> Attachments: NUTCH-1325-1.6-1.patch,
> NUTCH-1325-removed-from-1.8.patch, NUTCH-1325-trunk-v3.patch,
> NUTCH-1325-trunk-v4.patch, NUTCH-1325-trunk-v5.patch, NUTCH-1325-v4-v5.patch,
> NUTCH-1325.trunk.v2.path, oi-hostdb.patch
>
>
> A HostDB for Nutch and associated tools to create and read a database
> containing information on hosts.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)