[ 
http://issues.apache.org/jira/browse/NUTCH-353?page=comments#action_12440221 ] 
            
Uros Gruber commented on NUTCH-353:
-----------------------------------

I don't think there is 100% solution. Mostly because not all respect standards. 
For example www.imb.com uses 302 status code which by RFC definition - (The 
requested resource resides temporarily under a different URI. Since the 
redirection might be altered on occasion, the client SHOULD continue to use the 
Request-URI for future requests. This response is only cacheable if indicated 
by a Cache-Control or Expires header field. ). This case is clear. We should 
use original URL.

But then there is also permanent redirect which SHOULD replace old URL and also 
update all links pointing to old URL with new one.

I also saw some examples of wrong redirections. One of them was my fault to. I 
use Alias definition with apache server for accepting connections without www 
subdomain. And then with the page I left link to main page pointing to 
index.php instead of just /. After a while my domain.si/index.php became  more 
important than www.domain.si (bot points to the same site)

So as I see this job is not simple at all. Maybe we need a schema or some sort 
of flow diagram to indicate what to do in determinant situation.

I hope my notes helps a bit because at the moment we really have a lot of 
unwanted urls in our index.


> pages that serverside forwards will be refetched every time
> -----------------------------------------------------------
>
>                 Key: NUTCH-353
>                 URL: http://issues.apache.org/jira/browse/NUTCH-353
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.8.1, 0.9.0
>            Reporter: Stefan Groschupf
>         Assigned To: Andrzej Bialecki 
>            Priority: Blocker
>             Fix For: 0.9.0
>
>         Attachments: doNotRefecthForwarderPagesV1.patch
>
>
> Pages that do a serverside forward are not written with a status change back 
> into the crawlDb. Also the nextFetchTime is not changed. 
> This causes a refetch of the same page again and again. The result is nutch 
> is not polite and refetching the forwarding and target page in each segment 
> iteration. Also it effects the scoring since the forward page contribute it's 
> score to all outlinks.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to