Re: support for robot rules that include a wild card

Ken Krugler Thu, 19 Nov 2009 11:43:39 -0800

Hi Jason,

I've been spending some time on an improved robots.txt parser, as partof my Bixo project.


One aspect is support for Google wildcard extensions.

I think this will be part of the proposed "crawler-commons" projectwhere we'll put components that can/should be shared between Nutch,Bixo, Heritrix and Droids.

One thing that would be useful is to collect examples of "advanced"robots.txt files, in addition to broken ones.

It would be great if you could open a Jira issue and attach specificexamples of the above that you know about.


Thanks!

-- Ken


On Nov 19, 2009, at 11:31am, J.G.Konrad wrote:

I'm using nutch-1.0 and have noticed after running some tests that the
robot rules parser does not support wildcard (a.k.a globbing) in
rules. This means the rule will not work like it was expected to by
the person who wrote the robots.txt file.  For example

User-Agent: *
Disallow: /somepath/*/someotherpath

Even yahoo has one rule ( http://m.www.yahoo.com/robots.txt )
User-agent: *
Disallow: /p/
Disallow: /r/
Disallow: /*?

With the popularity of the wildcard (*) in robots.txt files these days
what are the plans/thoughts on adding support for it in Nutch?

Thanks,
 Jason


--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: support for robot rules that include a wild card

Reply via email to