Hi

You 're not the only one who has suffered this from Google, but Baidu is
worse and some others as well, giving you telegram answers to your points...

Yes I have also suffered a lot from crawlers, and I have spend a lot of
hours trying to adjut firewalls, robots....

What version of Koha you're using? moderns one have a command koha-sitemap
(If I am not wrong)

Google Webmaster warns you that it has not inmediate effect, you should
wait a little more...


In summary you have done all the expected work, now it is just time to
ajust it and wait for the results

With the combinations of robots.txt, koha-sitemap & firewall I have been
happy for a long time... but you're not save from this never

:( I am sorry..








2017-05-03 16:14 GMT+02:00 Michael Kuhn <m...@adminkuhn.ch>:

> Hi Mark and Hugo
>
> Many thanks for your hints! I have now done the following.
>
> 1. I created a file "/usr/share/koha/opac/htdocs/robots.txt" containing
> this:
>
>  Sitemap: sitemapindex.xml
>  User-agent: *
>  Disallow: /cgi-bin/
>
> 2. I generated a Koha sitemap using the seemingly undocumented Perl script
> "sitemap.pl" (according to https://bugs.koha-community.or
> g/bugzilla3/show_bug.cgi?id=11190) which created the file
> "/usr/share/koha/opac/htdocs/sitemapindex.xml" and the file
> "/usr/share/koha/opac/htdocs/sitemap0001.xml" containing the URLs.
>
> 3. Even after a complete reboot of the host the "opac-search.pl"
> processes were still there, appearing immediately after the reboot!
>
> 4. I went to Google Webmaster Tools where I downloaded the HTML
> confirmation file "googleb56bd3db2af352b1.html" and placed it in
> "/usr/share/koha/opac/htdocs" as well. I also followed the steps given on
> the Wemaster Tools page, i. e. I called the URL and I confirmed the
> download.
>
> 5. Even after a complete reboot of the host the "opac-search.pl"
> processes were still there, appearing immediately after the reboot!
>
> 6. I then installed the Uncomplicated Firewall / UFW where I applied the
> following rules and enabled it:
>
>  # ufw status
>  Status: active
>
>  To                         Action      From
>  --                         ------      ----
>  22/tcp                     ALLOW       Anywhere
>  80/tcp                     ALLOW       Anywhere
>  8080/tcp                   ALLOW       Anywhere
>  Anywhere                   DENY        66.249.64.32
>
> But however this is possible, still Googlebot is crawling and eating CPU!
> This can be seen in the log file "plack.log" where hundreds and thousands
> of lines like the following can be seen:
>
>  66.249.64.32 - - [03/May/2017:15:48:28 +0200] "GET /opac/
> opac-authoritiesdetail.pl?authid=12872 HTTP/1.1" 200 17703 "-"
> "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
>
> And I also found another bot:
>
>  62.138.14.218 - - [03/May/2017:15:48:29 +0200] "GET /opac/
> opac-search.pl?q=se,phr:%22Zeitreise%22 HTTP/1.1" 200 54672 "-" "Linguee
> Bot (http://www.linguee.com/bot; b...@linguee.com)"
>
> Now what I don't understand is how Googlebot (66.249.64.32) can access
> the webserver even if it is blocked by UFW?!
>
> 9. Already quite desperate I finally executed the following line to drop
> all packets from 66.249.64.32.
>
>  # iptables -I INPUT -s 66.249.64.32 -j DROP
>  # iptables -I INPUT -s 62.138.14.218 -j DROP
>
> And yes - this actually stopped these harassing bots.
>
> But of course, next was this:
>
>  66.249.64.35 - - [03/May/2017:15:59:21 +0200] "GET /opac/
> opac-authoritiesdetail.pl?authid=16429 HTTP/1.1" 200 17661 "-"
> "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
>
> I also dropped this IP address and now - finally! - the OPAC search for
> the normal user works as fast as expected.
>
> In fact I can't believe I should be the only one experiencing this
> behavior (especially since the stuff about creating "sitemap.pl" is quite
> hidden and however undocumented in the Koha manual).
>
> The other thing is people usually say it's a good thing to be indexed by
> Google. Today however, I won't agree. Maybe tomorrow, I will then try to
> delete the rule which drops the Google packets and I really hope Google
> will then do what it is told to do in "robots.txt", using the Koha sitemap.
>
> So all this just for the record - maybe it will help someone in the future.
>
> Best wishes: Michael
> --
> Geschäftsführer · Diplombibliothekar BBS, Informatiker eidg. Fachausweis
> Admin Kuhn GmbH · Pappelstrasse 20 · 4123 Allschwil · Schweiz
> T 0041 (0)61 261 55 61 · E m...@adminkuhn.ch · W www.adminkuhn.ch
>



-- 

*Hugo Agud - Orex Digital *

*www.orex.es <http://www.orex.es>*


<http://www.orex.es/>    [image: www.orex.es/koha] <http://www.orex.es/koha>
   [image: www.orex.es/vufind] <http://www.orex.es/vufind>
<http://www.orex.es/omeka>


Director

Calle Sant Joaquin,117, 2º-3ª · 08922 Santa Coloma de Gramanet - Tel: 933
856 138   ha...@orex.es · http://www.orex.es/



No imprima este mensaje a no ser que sea necesario. Una tonelada de papel
implica la tala de 15 árboles y el consumo de 250.000 litros de agua.



Aviso de confidencialidad
Este mensaje contiene información que puede ser CONFIDENCIAL y/o de USO
RESTRINGIDO. Si usted no es el receptor deseado del mensaje (ni
está autorizado a recibirlo por el remitente), no está autorizado a copiar,
reenviar o divulgar el mensaje o su contenido. Si ha recibido este mensaje
por error, por favor, notifíquenoslo inmediatamente y bórrelo de su sistema.
_______________________________________________
Koha mailing list  http://koha-community.org
Koha@lists.katipo.co.nz
https://lists.katipo.co.nz/mailman/listinfo/koha

Reply via email to