Hi Jason,

I've been spending some time on an improved robots.txt parser, as part of my Bixo project.

One aspect is support for Google wildcard extensions.

I think this will be part of the proposed "crawler-commons" project where we'll put components that can/should be shared between Nutch, Bixo, Heritrix and Droids.

One thing that would be useful is to collect examples of "advanced" robots.txt files, in addition to broken ones.

It would be great if you could open a Jira issue and attach specific examples of the above that you know about.

Thanks!

-- Ken


On Nov 19, 2009, at 11:31am, J.G.Konrad wrote:

I'm using nutch-1.0 and have noticed after running some tests that the
robot rules parser does not support wildcard (a.k.a globbing) in
rules. This means the rule will not work like it was expected to by
the person who wrote the robots.txt file.  For example

User-Agent: *
Disallow: /somepath/*/someotherpath

Even yahoo has one rule ( http://m.www.yahoo.com/robots.txt )
User-agent: *
Disallow: /p/
Disallow: /r/
Disallow: /*?

With the popularity of the wildcard (*) in robots.txt files these days
what are the plans/thoughts on adding support for it in Nutch?

Thanks,
 Jason

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g




Reply via email to