[ 
http://issues.apache.org/jira/browse/NUTCH-273?page=comments#action_12413602 ] 

Lukas Vlcek commented on NUTCH-273:
-----------------------------------

May be I am wrong but handling redirects can be very complex topic and I am not 
sure if general solution can be easily found.

Right now I am facing to the following issue: we have a legacy document 
repository on corporate intranet (accessed via http) and people made a lot of 
links to it during the years but they never updated old html files with old 
links... so the result is that we have tons of links to documents that are 
already gone. If such documents are requested then document repository simply 
redirect such requests to default page (a main page in this case).

For example ulr links http://some.repo/executive_success.pdf and 
http://some.repo/individual_failure.doc are both redirected to the same default 
main page with unrelated content (it can be contact list for example). Does it 
mean that executive_success and individual_failure are both related to contact 
list?

I am not sure how much work nutch plugins could do for us here but to me it 
seems that handling redirects should be very flexible. Would it help if 
redirect handling is extracted out of nutch-core into plugin system?

> When a page is redirected, the original url is NOT updated.
> -----------------------------------------------------------
>
>          Key: NUTCH-273
>          URL: http://issues.apache.org/jira/browse/NUTCH-273
>      Project: Nutch
>         Type: Bug

>   Components: fetcher
>     Versions: 0.8-dev
>  Environment: n/a
>     Reporter: Lukas Vlcek

>
> [Excerpt from maillist, sender: Andrzej Bialecki]
> When a page is redirected, the original url is NOT updated - so, CrawlDB will 
> never know that a redirect occured, it won't even know that a fetch 
> occured... This looks like a bug.
> In 0.7 this was recorded in the segment, and then it would affect the Page 
> status during updatedb. It should do so 0.8, too...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
All the advantages of Linux Managed Hosting--Without the Cost and Risk!
Fully trained technicians. The highest number of Red Hat certifications in
the hosting industry. Fanatical Support. Click to learn more
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=107521&bid=248729&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to