Hello,

I'm trying to detect redirected Urls and Urls blocked by robots.txt,
but it appears ProtocolStatus values for those cases are not set.

This is how I'm trying to check if the Url was redirected:
(_currentOutput is FetcherOutput)

    public boolean isFetchRedirect() {
        int code = _currentOutput.getProtocolStatus().getCode();
        return (code == ProtocolStatus.MOVED)
        || (code == ProtocolStatus.TEMP_MOVED)
        || (code == ProtocolStatus.REDIR_EXCEEDED);
    }

This is how I'm trying to check if the Url was blocked by robots.txt:

    public boolean isFetchBlockedForRobots() {
        return _currentOutput.getProtocolStatus().getCode() ==
ProtocolStatus.ROBOTS_DENIED;
    }


Is this the right way to do it?

... after a bit of debugging I found that the latter case (Urls blocked
by robots) is inconsistent.  For example:

http://del.icio.us/merlinmann
STATUS: notfound(14), lastModified=0

http://del.icio.us/biketrouble
STATUS: exception(16), lastModified=0:
org.apache.nutch.protocol.ResourceGone: Blocked by robots.txt


That STATUS line is ProtocolStatus.toString();

I'm using the freshly baked Nutch 0.7-dev from SVN.  Baked this
morning.

Thanks,
Otis

____________________________________________________________________
Simpy -- simpy.com -- tags, social bookmarks, personal search engine


-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to