On 2009-12-01, at 12:31, Yaar Schnitman wrote: > The urls in sitemap.xml are not patterns - there are exact urls the search > engine will retrieve.
They are exact URLs that the crawler will retrieve, but they in no way restrict the set of URLs that the crawler can retrieve. > So, you would blacklist most urls with blanket rules in robots.txt and > whitelist explicit urls in sitemap.xml. e.g. in robots.txt, blacklist > /changeset/*, and in sitemap.xml whitelist all > http://trac.webkit.org/changeset/1 to http://trac.webkit.org/changeset/60000 > (It's going to be a big file alright). This proposal relies on the sitemap being treated as a whitelist that is consulted prior to processing the exclusions listed in robots.txt. As I have mentioned several times, I cannot find any information that states that a sitemap acts as a whitelist, nor what its precedence relative to robots.txt would be if it were to be treated as a whitelist. Is there something I'm missing? - Mark > On Tue, Dec 1, 2009 at 11:33 AM, Mark Rowe <mr...@apple.com> wrote: > > On 2009-12-01, at 11:04, Yaar Schnitman wrote: > >> Robots.txt can exclude most of the trac site, and then include the >> sitemap.xml. This way you block most of the junk and only give permission to >> the important file. All major search engine support sitemap.xml, and those >> that don't will be blocked by robots.txt. >> >> A script could generate sitemap.xml from a local svn checkout of trunk. It >> will produce one url for each source file (frequency=daily) and one url for >> every revision (frequency=year). That will cover most of the search >> requirements. > > Forgive me, but this doesn't seem to address the issues that I raised in my > previous message. > > To reiterate: We need to allow only an explicit set of URLs to be crawled. > Sitemaps do not provide this ability. They expose information about set of > URLs to a crawler, they do not limit the set of URLs that it can crawl. A > robots.txt file does provide the ability to limit the set of URLs that can be > crawled. > > However, the semantics of robots.txt seem to make it incredibly unwieldy to > expose only the content of interest, if it is possible at all. For instance, > to expose <http://trac.webkit.org/changeset/#{revision}> while preventing > <http://trac.webkit.org/changeset/#{revision}/#{path}> or > <http://trac.webkit.org/changeset/#{revision}?format=zip&new=#{revision}> > from being crawled. Another example would be exposing > <http://trac.webkit.org/browser/#{path}> while preventing > <http://trac.webkit.org/browser/#{path}?rev=#{revision}> from being crawled. > > Is there something that I'm missing? > > - Mark > >
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev