On 2009-12-01, at 12:31, Yaar Schnitman wrote:

> The urls in sitemap.xml are not patterns - there are exact urls the search 
> engine will retrieve.

They are exact URLs that the crawler will retrieve, but they in no way restrict 
the set of URLs that the crawler can retrieve.

> So, you would blacklist most urls with blanket rules in robots.txt and 
> whitelist explicit urls in sitemap.xml. e.g. in robots.txt, blacklist 
> /changeset/*, and in sitemap.xml whitelist all 
> http://trac.webkit.org/changeset/1 to http://trac.webkit.org/changeset/60000 
> (It's going to be a big file alright).

This proposal relies on the sitemap being treated as a whitelist that is 
consulted prior to processing the exclusions listed in robots.txt.  As I have 
mentioned several times, I cannot find any information that states that a 
sitemap acts as a whitelist, nor what its precedence relative to robots.txt 
would be if it were to be treated as a whitelist.  Is there something I'm 
missing?

- Mark

> On Tue, Dec 1, 2009 at 11:33 AM, Mark Rowe <mr...@apple.com> wrote:
> 
> On 2009-12-01, at 11:04, Yaar Schnitman wrote:
> 
>> Robots.txt can exclude most of the trac site, and then include the 
>> sitemap.xml. This way you block most of the junk and only give permission to 
>> the important file. All major search engine support sitemap.xml, and those 
>> that don't will be blocked by robots.txt.
>> 
>> A script could generate sitemap.xml from a local svn checkout of trunk. It 
>> will produce one url for each source file (frequency=daily) and one url for 
>> every revision (frequency=year). That will cover most of the search 
>> requirements.
> 
> Forgive me, but this doesn't seem to address the issues that I raised in my 
> previous message.
> 
> To reiterate: We need to allow only an explicit set of URLs to be crawled.  
> Sitemaps do not provide this ability.  They expose information about set of 
> URLs to a crawler, they do not limit the set of URLs that it can crawl.  A 
> robots.txt file does provide the ability to limit the set of URLs that can be 
> crawled.
> 
> However, the semantics of robots.txt seem to make it incredibly unwieldy to 
> expose only the content of interest, if it is possible at all.  For instance, 
> to expose <http://trac.webkit.org/changeset/#{revision}> while preventing 
> <http://trac.webkit.org/changeset/#{revision}/#{path}> or 
> <http://trac.webkit.org/changeset/#{revision}?format=zip&new=#{revision}> 
> from being crawled.  Another example would be exposing 
> <http://trac.webkit.org/browser/#{path}> while preventing 
> <http://trac.webkit.org/browser/#{path}?rev=#{revision}> from being crawled.
> 
> Is there something that I'm missing?
> 
> - Mark
> 
> 

Attachment: smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev

Reply via email to