(nutch-site) 01/04: Link to RFC 9309 which Nutch (relying on crawler-commons) is following as robots.txt standard since Nutch 1.19

snagel Sun, 20 Jul 2025 11:42:35 -0700

This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/nutch-site.git


commit 607dc40083c741aa42086ae8e77d46b7f864b5cf
Author: Sebastian Nagel <[email protected]>
AuthorDate: Sun Jul 20 19:40:47 2025 +0200

    Link to RFC 9309 which Nutch (relying on crawler-commons)
    is following as robots.txt standard since Nutch 1.19
---
 content/community/bot.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/content/community/bot.md b/content/community/bot.md
index 0c33b84..fff55e5 100644
--- a/content/community/bot.md
+++ b/content/community/bot.md
@@ -13,7 +13,7 @@ If you're reading this, chances are you've seen a Nutch-based 
robot visiting you
 # Sysadmins/robots.txt
 We're a software project, not a service, so please understand that a 
misbehaving crawler appearing with our Agent string is not run by us. Our 
software may be run by anyone. However, we'd still like to hear about any bad 
behavior. If possible, please include the name of the domain and some 
representative log entries. We can be reached at 
`dev[at]nutch[dot]apache[dot]org`
 
-Our software obeys the <a href="http://www.robotstxt.org/robotstxt.html"; 
target="_blank">robots.txt exclusion standard</a>. Different Nutch deployments 
may specify different agent names, but all should respond to the agent name 
"Nutch". Thus to ban all Nutch-based crawlers from your site, place the 
following in your robots.txt file:
+Our software obeys the <a 
href="https://en.wikipedia.org/wiki/Robots_exclusion_standard"; 
target="_blank">robots.txt exclusion standard</a> as specified in <a 
href="https://datatracker.ietf.org/doc/html/rfc9309";>RFC 9309</a>. Different 
Nutch deployments may specify different agent names, but all should respond to 
the agent name "Nutch". Thus to ban all Nutch-based crawlers from your site, 
place the following in your robots.txt file:
 
 ```
 User-agent: Nutch
@@ -21,4 +21,4 @@ Disallow: /
 ```
 
 # Webmasters/Robots META
- If you do not have permission to edit the `/robots.txt` file on your server, 
you can still tell robots not to index your pages or follow your links. The 
standard mechanism for this is the <a href="http://www.robotstxt.org/meta.html"; 
target="_blank">robots META tag</a>.
\ No newline at end of file
+ If you do not have permission to edit the `/robots.txt` file on your server, 
you can still tell robots not to index your pages or follow your links. The 
standard mechanism for this is the <a 
href="https://www.robotstxt.org/meta.html"; target="_blank">robots META tag</a>.
\ No newline at end of file

(nutch-site) 01/04: Link to RFC 9309 which Nutch (relying on crawler-commons) is following as robots.txt standard since Nutch 1.19

Reply via email to