Hi You 're not the only one who has suffered this from Google, but Baidu is worse and some others as well, giving you telegram answers to your points...
Yes I have also suffered a lot from crawlers, and I have spend a lot of hours trying to adjut firewalls, robots.... What version of Koha you're using? moderns one have a command koha-sitemap (If I am not wrong) Google Webmaster warns you that it has not inmediate effect, you should wait a little more... In summary you have done all the expected work, now it is just time to ajust it and wait for the results With the combinations of robots.txt, koha-sitemap & firewall I have been happy for a long time... but you're not save from this never :( I am sorry.. 2017-05-03 16:14 GMT+02:00 Michael Kuhn <m...@adminkuhn.ch>: > Hi Mark and Hugo > > Many thanks for your hints! I have now done the following. > > 1. I created a file "/usr/share/koha/opac/htdocs/robots.txt" containing > this: > > Sitemap: sitemapindex.xml > User-agent: * > Disallow: /cgi-bin/ > > 2. I generated a Koha sitemap using the seemingly undocumented Perl script > "sitemap.pl" (according to https://bugs.koha-community.or > g/bugzilla3/show_bug.cgi?id=11190) which created the file > "/usr/share/koha/opac/htdocs/sitemapindex.xml" and the file > "/usr/share/koha/opac/htdocs/sitemap0001.xml" containing the URLs. > > 3. Even after a complete reboot of the host the "opac-search.pl" > processes were still there, appearing immediately after the reboot! > > 4. I went to Google Webmaster Tools where I downloaded the HTML > confirmation file "googleb56bd3db2af352b1.html" and placed it in > "/usr/share/koha/opac/htdocs" as well. I also followed the steps given on > the Wemaster Tools page, i. e. I called the URL and I confirmed the > download. > > 5. Even after a complete reboot of the host the "opac-search.pl" > processes were still there, appearing immediately after the reboot! > > 6. I then installed the Uncomplicated Firewall / UFW where I applied the > following rules and enabled it: > > # ufw status > Status: active > > To Action From > -- ------ ---- > 22/tcp ALLOW Anywhere > 80/tcp ALLOW Anywhere > 8080/tcp ALLOW Anywhere > Anywhere DENY 66.249.64.32 > > But however this is possible, still Googlebot is crawling and eating CPU! > This can be seen in the log file "plack.log" where hundreds and thousands > of lines like the following can be seen: > > 66.249.64.32 - - [03/May/2017:15:48:28 +0200] "GET /opac/ > opac-authoritiesdetail.pl?authid=12872 HTTP/1.1" 200 17703 "-" > "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" > > And I also found another bot: > > 62.138.14.218 - - [03/May/2017:15:48:29 +0200] "GET /opac/ > opac-search.pl?q=se,phr:%22Zeitreise%22 HTTP/1.1" 200 54672 "-" "Linguee > Bot (http://www.linguee.com/bot; b...@linguee.com)" > > Now what I don't understand is how Googlebot (66.249.64.32) can access > the webserver even if it is blocked by UFW?! > > 9. Already quite desperate I finally executed the following line to drop > all packets from 66.249.64.32. > > # iptables -I INPUT -s 66.249.64.32 -j DROP > # iptables -I INPUT -s 62.138.14.218 -j DROP > > And yes - this actually stopped these harassing bots. > > But of course, next was this: > > 66.249.64.35 - - [03/May/2017:15:59:21 +0200] "GET /opac/ > opac-authoritiesdetail.pl?authid=16429 HTTP/1.1" 200 17661 "-" > "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" > > I also dropped this IP address and now - finally! - the OPAC search for > the normal user works as fast as expected. > > In fact I can't believe I should be the only one experiencing this > behavior (especially since the stuff about creating "sitemap.pl" is quite > hidden and however undocumented in the Koha manual). > > The other thing is people usually say it's a good thing to be indexed by > Google. Today however, I won't agree. Maybe tomorrow, I will then try to > delete the rule which drops the Google packets and I really hope Google > will then do what it is told to do in "robots.txt", using the Koha sitemap. > > So all this just for the record - maybe it will help someone in the future. > > Best wishes: Michael > -- > Geschäftsführer · Diplombibliothekar BBS, Informatiker eidg. Fachausweis > Admin Kuhn GmbH · Pappelstrasse 20 · 4123 Allschwil · Schweiz > T 0041 (0)61 261 55 61 · E m...@adminkuhn.ch · W www.adminkuhn.ch > -- *Hugo Agud - Orex Digital * *www.orex.es <http://www.orex.es>* <http://www.orex.es/> [image: www.orex.es/koha] <http://www.orex.es/koha> [image: www.orex.es/vufind] <http://www.orex.es/vufind> <http://www.orex.es/omeka> Director Calle Sant Joaquin,117, 2º-3ª · 08922 Santa Coloma de Gramanet - Tel: 933 856 138 ha...@orex.es · http://www.orex.es/ No imprima este mensaje a no ser que sea necesario. Una tonelada de papel implica la tala de 15 árboles y el consumo de 250.000 litros de agua. Aviso de confidencialidad Este mensaje contiene información que puede ser CONFIDENCIAL y/o de USO RESTRINGIDO. Si usted no es el receptor deseado del mensaje (ni está autorizado a recibirlo por el remitente), no está autorizado a copiar, reenviar o divulgar el mensaje o su contenido. Si ha recibido este mensaje por error, por favor, notifíquenoslo inmediatamente y bórrelo de su sistema. _______________________________________________ Koha mailing list http://koha-community.org Koha@lists.katipo.co.nz https://lists.katipo.co.nz/mailman/listinfo/koha