[ 
https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15110729#comment-15110729
 ] 

Markus Jelsma commented on NUTCH-1325:
--------------------------------------

Yes, they are very useful for finding websites that, for example, overall score 
positively on custom text or structure classifiers such as give me all hosts 
that in general talk about music, politics or illicit topics. Also, the dumping 
can generate a wide variety of blacklists for e.g. not crawling (generating) 
certain hosts, not indexing them of removing them completely. Of course, if 
your erase hosts from your CrawlDB, you must keep the blacklist around, or it 
will come back at some point :)

> HostDB for Nutch
> ----------------
>
>                 Key: NUTCH-1325
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1325
>             Project: Nutch
>          Issue Type: New Feature
>          Components: hostdb
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.12
>
>         Attachments: NUTCH-1325-1.6-1.patch, 
> NUTCH-1325-removed-from-1.8.patch, NUTCH-1325-trunk-v3.patch, 
> NUTCH-1325-trunk-v4.patch, NUTCH-1325-trunk-v5.patch, NUTCH-1325-v4-v5.patch, 
> NUTCH-1325.patch, NUTCH-1325.patch, NUTCH-1325.patch, NUTCH-1325.patch, 
> NUTCH-1325.trunk.v2.path, oi-hostdb.patch, oi-hostdb.patch, oi-hostdb.patch
>
>
> h1. HostDB for Apache Nutch 1.x
> * automatically generates a HostDB based on CrawlDB information
> * periodically performs DNS lookup for all hosts and keeps track of DNS 
> failures
> * discovers homepage if www.example.org/ is a redirect
> * keeps track of host statistics such as number of URL's, 404's, not 
> modifieds and redirects
> * aggregates CrawlDB metadata fields into totals, sums, min, max, average and 
> configurable percentiles
> * can output lists of discovered homepage URL's for seed lists and static 
> fetch interval
> *can output blacklists for hosts that have too many DNS failures to filter 
> from the CrawlDB using domainblacklist-urlfilter
> * just like CrawlDB support for JEXL expressions
> h4. Examples
> Generate for the first time, or update and existing HostDB:
> {code}
> bin/nutch updatehostdb -hostdb crawl/hostdb -crawldb crawl/crawldb
> {code}
> Optional filtering or normalizing:
> {code}
> bin/nutch updatehostdb -hostdb crawl/hostdb -crawldb crawl/crawldb -filter 
> -normalize
> {code}
> Dumping as CSV file:
> {code}
> bin/nutch readhostdb crawl/hostdb output_directory
> {code}
> Get only hostnames with have average response time above 50ms:
> {code}
> bin/nutch readhostdb crawl/hostdb output_directory -dumpHostnames -expr 
> "(avg._rs_ > 50)"
> {code}
> Get only hosts that have over 50% 404's:
> {code}
> bin/nutch readhostdb crawl/hostdb output_directory -dumpHostnames -expr 
> "(gone / numRecords > 0.5)"
> {code}
> For JEXL expressions, all host metadata fields are available. All other 
> fields are also available as:
> unfetched -- number of unfetched records
> fetched -- number of fetched records
> gone -- number of  404's
> redirTemp -- number if temporary redirects
> redirPerm -- number if permanent redirects
> redirs -- total number of redirects (redirTemp + redirPerm)
> notModified -- number of not modified records
> ok -- number of usable pages (fetched + notModified)
> numRecords -- total number of records
> dnsFailures -- number of DNS failures
> Also, see nutch-default for hostdb.* properties.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to