GitHub user sebastian-nagel opened a pull request:
https://github.com/apache/nutch/pull/141
NUTCH-2300 Fetcher to optionally save robots.txt
If the property fetcher.store.robotstxt is set to true, Fetcher saves the
robots.txt
response (URL and Content including HTTP protocol status and metadata) in
the
segment (subfolder content/). It does not add a fetch datum, simply because
this
avoids that the robots.txt URL slips into CrawlDb or gets indexed. The
robots.txt
can then be retrieved from the segment, e.g., by
```
# inject http://nutch.apache.org/
# generate
# and fetch with
bin/nutch fetch -Dfetcher.store.robotstxt=true -Dfetcher.store.content=true
...path_to_segment
# dump segment (without -nocontent)
bin/nutch readseg -dump ...path_to_segment ...path_to_dump
cat ...path_to_dump/dump
...
URL:: http://nutch.apache.org/robots.txt
Content::
Version: -1
url: http://nutch.apache.org/robots.txt
base: http://nutch.apache.org/robots.txt
contentType: text/html
metadata: nutch.fetch.time=1471612087645 Server=Apache/2.4.7 (Ubuntu)
Connection=close Content-Length=208 Date=Fri, 19 Aug 2016 13:08:07 GMT
Content-Type=text/html; charset=iso-8859-1
Content:
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>404 Not Found</title>
</head><body>
<h1>Not Found</h1>
<p>The requested URL /robots.txt was not found on this server.</p>
</body></html>
...
```
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/sebastian-nagel/nutch SaveRobotsTxt
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/nutch/pull/141.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #141
----
commit 6c9cca5e55e43458cbc5e59b8591e4d27ac425a2
Author: Sebastian Nagel <[email protected]>
Date: 2016-05-25T12:24:11Z
Allow Fetcher to optionally store robots.txt content (if property
fetcher.store.robotstxt == true).
Improved RobotRulesParser command-line tool.
commit 264eea01a4d868578dcf641d6ce405444d276929
Author: Sebastian Nagel <[email protected]>
Date: 2016-08-19T13:06:14Z
Ignore robots.txt when parsing segment, refactored storing of robots.txt in
FetcherThread
commit 33cdca76ac91a63445d4e761081e8124a23413af
Author: Sebastian Nagel <[email protected]>
Date: 2016-08-19T13:32:34Z
add hint and log warning that fetcher.store.robotstxt works only in
combination with fetcher.store.content
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---