[
https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15110708#comment-15110708
]
Hudson commented on NUTCH-1325:
-------------------------------
SUCCESS: Integrated in Nutch-trunk #3339 (See
[https://builds.apache.org/job/Nutch-trunk/3339/])
NUTCH-1325 HostDB for Nutch (markus:
[http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1725952])
* trunk/CHANGES.txt
* trunk/conf/log4j.properties
* trunk/conf/nutch-default.xml
* trunk/ivy/ivy.xml
* trunk/src/bin/nutch
* trunk/src/java/org/apache/nutch/crawl/NutchWritable.java
* trunk/src/java/org/apache/nutch/hostdb
* trunk/src/java/org/apache/nutch/hostdb/HostDatum.java
* trunk/src/java/org/apache/nutch/hostdb/ReadHostDb.java
* trunk/src/java/org/apache/nutch/hostdb/ResolverThread.java
* trunk/src/java/org/apache/nutch/hostdb/UpdateHostDb.java
* trunk/src/java/org/apache/nutch/hostdb/UpdateHostDbMapper.java
* trunk/src/java/org/apache/nutch/hostdb/UpdateHostDbReducer.java
> HostDB for Nutch
> ----------------
>
> Key: NUTCH-1325
> URL: https://issues.apache.org/jira/browse/NUTCH-1325
> Project: Nutch
> Issue Type: New Feature
> Components: hostdb
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-1325-1.6-1.patch,
> NUTCH-1325-removed-from-1.8.patch, NUTCH-1325-trunk-v3.patch,
> NUTCH-1325-trunk-v4.patch, NUTCH-1325-trunk-v5.patch, NUTCH-1325-v4-v5.patch,
> NUTCH-1325.patch, NUTCH-1325.patch, NUTCH-1325.patch, NUTCH-1325.patch,
> NUTCH-1325.trunk.v2.path, oi-hostdb.patch, oi-hostdb.patch, oi-hostdb.patch
>
>
> h1. HostDB for Apache Nutch 1.x
> * automatically generates a HostDB based on CrawlDB information
> * periodically performs DNS lookup for all hosts and keeps track of DNS
> failures
> * discovers homepage if www.example.org/ is a redirect
> * keeps track of host statistics such as number of URL's, 404's, not
> modifieds and redirects
> * aggregates CrawlDB metadata fields into totals, sums, min, max, average and
> configurable percentiles
> * can output lists of discovered homepage URL's for seed lists and static
> fetch interval
> *can output blacklists for hosts that have too many DNS failures to filter
> from the CrawlDB using domainblacklist-urlfilter
> * just like CrawlDB support for JEXL expressions
> h4. Examples
> Generate for the first time, or update and existing HostDB:
> {code}
> bin/nutch updatehostdb -hostdb crawl/hostdb -crawldb crawl/crawldb
> {code}
> Optional filtering or normalizing:
> {code}
> bin/nutch updatehostdb -hostdb crawl/hostdb -crawldb crawl/crawldb -filter
> -normalize
> {code}
> Dumping as CSV file:
> {code}
> bin/nutch readhostdb crawl/hostdb output_directory
> {code}
> Get only hostnames with have average response time above 50ms:
> {code}
> bin/nutch readhostdb crawl/hostdb output_directory -dumpHostnames -expr
> "(avg._rs_ > 50)"
> {code}
> Get only hosts that have over 50% 404's:
> {code}
> bin/nutch readhostdb crawl/hostdb output_directory -dumpHostnames -expr
> "(gone / numRecords > 0.5)"
> {code}
> For JEXL expressions, all host metadata fields are available. All other
> fields are also available as:
> unfetched -- number of unfetched records
> fetched -- number of fetched records
> gone -- number of 404's
> redirTemp -- number if temporary redirects
> redirPerm -- number if permanent redirects
> redirs -- total number of redirects (redirTemp + redirPerm)
> notModified -- number of not modified records
> ok -- number of usable pages (fetched + notModified)
> numRecords -- total number of records
> dnsFailures -- number of DNS failures
> Also, see nutch-default for hostdb.* properties.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)