Re: NoRobotClient bug? Seems like it doesn't check the to-be-crawled URI against robots.txt properly

Mingfai Ma Sat, 04 Apr 2009 18:36:03 -0700

On 5 Apr 2009, at 7:44 AM, Robin Howlett <[email protected]>wrote:

I was just looking through NoRobotClient and have concern whetherDroids
will actually respect robots.txt when force allow is false in most
scenarios; consider the following robots.txt:

User-agent: *
Disallow: /foo/

and the starting URI: http://www.example.com/foo/bar.html

In the code I see - in NoRobotClient.isUrlAllowed() - the following:

String path = uri.getPath();
String basepath = baseURI.getPath();
if (path.startsWith(basepath)) {
path = path.substring(basepath.length());
if (!path.startsWith("/")) {
  path = "/" + path;
}
}
...
Boolean allowed = this.rules != null ?this.rules.isAllowed( path ) : null;
if(allowed == null) {
allowed = this.wildcardRules != null ?this.wildcardRules.isAllowed( path )
: null;
}
if(allowed == null) {
allowed = Boolean.TRUE;
}
The path will always be converted to /bar.html and is checkedagainst theRules in rules and wildcardRules but won't be found. However,basepath (whichwill now be /foo) is never checked against the Rules, thereforegiving an
incorrect true result for the isUrlAllowed method, no?

robin

I believe the NoRobotClient has problem, too. My crawling job stuckwhen accessing the 2nd link of a domain. I hv to workaround theproblem with the force allow flag.


Regards
Mingfai

Re: NoRobotClient bug? Seems like it doesn't check the to-be-crawled URI against robots.txt properly

Reply via email to