[ 
https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14382495#comment-14382495
 ] 

Markus Jelsma commented on NUTCH-1325:
--------------------------------------

Hello Lewis - these are the new parameters:
+  public static final String HOSTDB_NUMERIC_FIELDS = "hostdb.numeric.fields";
+  public static final String HOSTDB_STRING_FIELDS = "hostdb.string.fields";

List the crawldatum md fields that are numeric and you want stats on in 
numeric, and string fields in string.fields. Run updatehostdb and you will get 
HostDatum md for the selected fields. A newer version also supports median 
stats on numerics. I am going to use these within memex soon!

I also plan an upgrade for dumphostdb so it can be used to let Nutch 
automatically restrict the crawl to metadata field values such as only english, 
or crawl only pages that are within a threshold (numeric) such as for instance 
illegal content, abusive stuff, whatever.

> HostDB for Nutch
> ----------------
>
>                 Key: NUTCH-1325
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1325
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Markus Jelsma
>             Fix For: 1.11
>
>         Attachments: NUTCH-1325-1.6-1.patch, 
> NUTCH-1325-removed-from-1.8.patch, NUTCH-1325-trunk-v3.patch, 
> NUTCH-1325-trunk-v4.patch, NUTCH-1325-trunk-v5.patch, NUTCH-1325-v4-v5.patch, 
> NUTCH-1325.trunk.v2.path, oi-hostdb.patch
>
>
> A HostDB for Nutch and associated tools to create and read a database 
> containing information on hosts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to