So lately Google has been pinging the WMF about the lack of sitemaps on Commons. If you don't know what those are, sitemaps are a way of telling search engines about all the URLs that are hosted on your site, so they can find them more easily, or more quickly.[1]
I investigated this issue and found that we do have a sitemaps script in maintenance, but it hasn't been enabled on the Wikipedias since 2007-12-27. In the meantime it was discovered that Google wrote some custom crawling bot for Recent Changes, so it was never re-enabled for them. As for Commons: we don't have a sitemap either, but from a cursory examination of Google Image Search I don't think they are crawling our Recent Changes. Even if they were, there's more to life than Google -- we also want to be in other search engines, tools like TinEye, etc. So it would be good to have this back again. a) any objections, volunteers, whatever, for re-enabling the sitemaps script on Commons? This means probably just adding it back into daily cron. b) anyone want to work on making it more efficient and/or better? Google has introduced some nifty extensions to the Sitemap protocol, including geocoding and (especially dear to our hearts) licensing![2] However we don't have such information easily available in the database, so this requires parsing through every File page, which will take several millenia. This will not work at all with the current sitemaps script as it scans the entire database every time and regenerates a number of sitemaps files from scratch. So, what we need is something more iterative, that only scans recent stuff. (Or, using such extensions will have to wait until someone brings licensing into the database). [1] http://sitemaps.org/ [2] http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=178636 -- Neil Kandalgaonkar <[email protected]> _______________________________________________ Commons-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/commons-l
