Re: [Koha] Koha slowed down by Google indexing?!

Michael Kuhn Wed, 03 May 2017 07:16:22 -0700

Hi Mark and Hugo

Many thanks for your hints! I have now done the following.

1. I created a file "/usr/share/koha/opac/htdocs/robots.txt" containingthis:


 Sitemap: sitemapindex.xml
 User-agent: *
 Disallow: /cgi-bin/

2. I generated a Koha sitemap using the seemingly undocumented Perlscript "sitemap.pl" (according tohttps://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=11190) whichcreated the file "/usr/share/koha/opac/htdocs/sitemapindex.xml" and thefile "/usr/share/koha/opac/htdocs/sitemap0001.xml" containing the URLs.

3. Even after a complete reboot of the host the "opac-search.pl"processes were still there, appearing immediately after the reboot!

4. I went to Google Webmaster Tools where I downloaded the HTMLconfirmation file "googleb56bd3db2af352b1.html" and placed it in"/usr/share/koha/opac/htdocs" as well. I also followed the steps givenon the Wemaster Tools page, i. e. I called the URL and I confirmed thedownload.

5. Even after a complete reboot of the host the "opac-search.pl"processes were still there, appearing immediately after the reboot!

6. I then installed the Uncomplicated Firewall / UFW where I applied thefollowing rules and enabled it:


 # ufw status
 Status: active

 To                         Action      From
 --                         ------      ----
 22/tcp                     ALLOW       Anywhere
 80/tcp                     ALLOW       Anywhere
 8080/tcp                   ALLOW       Anywhere
 Anywhere                   DENY        66.249.64.32

But however this is possible, still Googlebot is crawling and eatingCPU! This can be seen in the log file "plack.log" where hundreds andthousands of lines like the following can be seen:

66.249.64.32 - - [03/May/2017:15:48:28 +0200] "GET/opac/opac-authoritiesdetail.pl?authid=12872 HTTP/1.1" 200 17703 "-""Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"


And I also found another bot:

62.138.14.218 - - [03/May/2017:15:48:29 +0200] "GET/opac/opac-search.pl?q=se,phr:%22Zeitreise%22 HTTP/1.1" 200 54672 "-""Linguee Bot (http://www.linguee.com/bot; b...@linguee.com)"

Now what I don't understand is how Googlebot (66.249.64.32) can accessthe webserver even if it is blocked by UFW?!

9. Already quite desperate I finally executed the following line to dropall packets from 66.249.64.32.


 # iptables -I INPUT -s 66.249.64.32 -j DROP
 # iptables -I INPUT -s 62.138.14.218 -j DROP

And yes - this actually stopped these harassing bots.

But of course, next was this:

66.249.64.35 - - [03/May/2017:15:59:21 +0200] "GET/opac/opac-authoritiesdetail.pl?authid=16429 HTTP/1.1" 200 17661 "-""Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

I also dropped this IP address and now - finally! - the OPAC search forthe normal user works as fast as expected.

In fact I can't believe I should be the only one experiencing thisbehavior (especially since the stuff about creating "sitemap.pl" isquite hidden and however undocumented in the Koha manual).

The other thing is people usually say it's a good thing to be indexed byGoogle. Today however, I won't agree. Maybe tomorrow, I will then try todelete the rule which drops the Google packets and I really hope Googlewill then do what it is told to do in "robots.txt", using the Koha sitemap.


So all this just for the record - maybe it will help someone in the future.

Best wishes: Michael
--
Geschäftsführer · Diplombibliothekar BBS, Informatiker eidg. Fachausweis
Admin Kuhn GmbH · Pappelstrasse 20 · 4123 Allschwil · Schweiz
T 0041 (0)61 261 55 61 · E m...@adminkuhn.ch · W www.adminkuhn.ch
_______________________________________________
Koha mailing list  http://koha-community.org
Koha@lists.katipo.co.nz
https://lists.katipo.co.nz/mailman/listinfo/koha

Re: [Koha] Koha slowed down by Google indexing?!

Reply via email to