This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/nutch-site.git
commit 607dc40083c741aa42086ae8e77d46b7f864b5cf Author: Sebastian Nagel <[email protected]> AuthorDate: Sun Jul 20 19:40:47 2025 +0200 Link to RFC 9309 which Nutch (relying on crawler-commons) is following as robots.txt standard since Nutch 1.19 --- content/community/bot.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/community/bot.md b/content/community/bot.md index 0c33b84..fff55e5 100644 --- a/content/community/bot.md +++ b/content/community/bot.md @@ -13,7 +13,7 @@ If you're reading this, chances are you've seen a Nutch-based robot visiting you # Sysadmins/robots.txt We're a software project, not a service, so please understand that a misbehaving crawler appearing with our Agent string is not run by us. Our software may be run by anyone. However, we'd still like to hear about any bad behavior. If possible, please include the name of the domain and some representative log entries. We can be reached at `dev[at]nutch[dot]apache[dot]org` -Our software obeys the <a href="http://www.robotstxt.org/robotstxt.html" target="_blank">robots.txt exclusion standard</a>. Different Nutch deployments may specify different agent names, but all should respond to the agent name "Nutch". Thus to ban all Nutch-based crawlers from your site, place the following in your robots.txt file: +Our software obeys the <a href="https://en.wikipedia.org/wiki/Robots_exclusion_standard" target="_blank">robots.txt exclusion standard</a> as specified in <a href="https://datatracker.ietf.org/doc/html/rfc9309">RFC 9309</a>. Different Nutch deployments may specify different agent names, but all should respond to the agent name "Nutch". Thus to ban all Nutch-based crawlers from your site, place the following in your robots.txt file: ``` User-agent: Nutch @@ -21,4 +21,4 @@ Disallow: / ``` # Webmasters/Robots META - If you do not have permission to edit the `/robots.txt` file on your server, you can still tell robots not to index your pages or follow your links. The standard mechanism for this is the <a href="http://www.robotstxt.org/meta.html" target="_blank">robots META tag</a>. \ No newline at end of file + If you do not have permission to edit the `/robots.txt` file on your server, you can still tell robots not to index your pages or follow your links. The standard mechanism for this is the <a href="https://www.robotstxt.org/meta.html" target="_blank">robots META tag</a>. \ No newline at end of file
