Neil Kandalgaonkar wrote: > So lately Google has been pinging the WMF about the lack of sitemaps on > Commons. If you don't know what those are, sitemaps are a way of telling > search engines about all the URLs that are hosted on your site, so they > can find them more easily, or more quickly.[1]
We have had traditionally problems with images, description pages assumed to be images... > I investigated this issue and found that we do have a sitemaps script in > maintenance, but it hasn't been enabled on the Wikipedias since > 2007-12-27. In the meantime it was discovered that Google wrote some > custom crawling bot for Recent Changes, so it was never re-enabled for them. > > As for Commons: we don't have a sitemap either, but from a cursory > examination of Google Image Search I don't think they are crawling our > Recent Changes. Even if they were, there's more to life than Google -- > we also want to be in other search engines, tools like TinEye, etc. So > it would be good to have this back again. > > a) any objections, volunteers, whatever, for re-enabling the sitemaps > script on Commons? This means probably just adding it back into daily cron. Have you tested it first? How long does it take? > b) anyone want to work on making it more efficient and/or better? Commons has 13M pages. That means generating at least 260 sitemaps. You could do some tricks grouping pages into sitemaps by page_id, and then updating the sitemap on update, but updating your url among 10000 inside a text file would lead to lots of apaches waiting for the file lock. That could be overcome with some kind of journal applied later to the sitemaps, but doing a full circle, that's equivalent to updating the sitemap based in recentchanges data. > Google has introduced some nifty extensions to the Sitemap protocol, > including geocoding and (especially dear to our hearts) licensing![2] > However we don't have such information easily available in the database, > so this requires parsing through every File page, which will take > several millenia. > > This will not work at all with the current sitemaps script as it scans > the entire database every time and regenerates a number of sitemaps > files from scratch. So, what we need is something more iterative, that > only scans recent stuff. (Or, using such extensions will have to wait > until someone brings licensing into the database). We can start using <image:image> <image:loc> now. The other extensions will have to wait. _______________________________________________ Commons-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/commons-l
