[Koha-bugs] [Bug 33317] Add system preference to set meta robots for the OPAC

bugzilla-daemon--- via Koha-bugs Tue, 02 Apr 2024 00:26:27 -0700

https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=33317


M <[email protected]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |[email protected]
                   |                            |om

--- Comment #27 from M <[email protected]> ---
Wait, so this simply adds the same robots tag for the entirety of OPAC? I've
been redirected here from Bug 35812 due to conflicts. And I think this bug
right here is questionable and I'm not sure if it should be merged, at least
as-is.

> Websites must have a robots meta tag

That is not true, this tag is very much optional and meant for granular
page-level steering of crawling bots. The way this preference is implemented,
the rules are going to apply to ALL opac pages, which I'm not sure if there's
any reasonable use-case. The author mentions example "noindex,nofollow" to
prevent ALL opac pages from being indexed. I think that if some library wants
that, they'd be better off using the more widely used and known robots.txt
file, which is more likely to be supported by various crawlers, and will
prevent them from downloading the pages in the first place (instead of
downloading the pages it wanted to and then discarding them upon discovering
the meta tag for particular page).

I think the better direction is to diversify manually which pages should be
crawlable by default and which shouldn't like in Bug 35812, ie. search results
and so on (dynamic pages) shouldn't be indexed to decreate the amount of junk
(but they should be crawled to extract links from it), while main page/info
subpages/user-created lists/biblio records should probably be indexed by
default and it's probably what most libraries would want by default.

With that said, there currently there's no "obvious"/"easy" way of specifying
custom robots.txt file, apart from doing something like `Alias /robots.txt
/var/www/html/robots.txt` in Apache config for opac (it works well enough btw).

So, in the spirit of what OP originally wanted, I believe it could be better to
consider instead adding a user preference for textarea of robots.txt file
contents, in place of site-wide robots meta tag contents. This would allow
libraries to set more granular rules, but someone who wants to block everything
could still just do:

User-agent: *
Disallow: /

Btw this is already documented in README.robots file in Koha's main git
directory (last edited 13 years ago, the last paragraph there is probably
outdated).

So I believe my idea above could solve the conflict between our two patches,
robots.txt usage is more widely documented on the Internet I believe, and
that'd override any rules that Koha devs could specify on per-template basis
manually in robots tag on pages like I did in Bug 35812...

-- 
You are receiving this mail because:
You are watching all bug changes.
_______________________________________________
Koha-bugs mailing list
[email protected]
https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/

[Koha-bugs] [Bug 33317] Add system preference to set meta robots for the OPAC

Reply via email to