[ http://issues.apache.org/jira/browse/NUTCH-273?page=comments#action_12413602 ]
Lukas Vlcek commented on NUTCH-273: ----------------------------------- May be I am wrong but handling redirects can be very complex topic and I am not sure if general solution can be easily found. Right now I am facing to the following issue: we have a legacy document repository on corporate intranet (accessed via http) and people made a lot of links to it during the years but they never updated old html files with old links... so the result is that we have tons of links to documents that are already gone. If such documents are requested then document repository simply redirect such requests to default page (a main page in this case). For example ulr links http://some.repo/executive_success.pdf and http://some.repo/individual_failure.doc are both redirected to the same default main page with unrelated content (it can be contact list for example). Does it mean that executive_success and individual_failure are both related to contact list? I am not sure how much work nutch plugins could do for us here but to me it seems that handling redirects should be very flexible. Would it help if redirect handling is extracted out of nutch-core into plugin system? > When a page is redirected, the original url is NOT updated. > ----------------------------------------------------------- > > Key: NUTCH-273 > URL: http://issues.apache.org/jira/browse/NUTCH-273 > Project: Nutch > Type: Bug > Components: fetcher > Versions: 0.8-dev > Environment: n/a > Reporter: Lukas Vlcek > > [Excerpt from maillist, sender: Andrzej Bialecki] > When a page is redirected, the original url is NOT updated - so, CrawlDB will > never know that a redirect occured, it won't even know that a fetch > occured... This looks like a bug. > In 0.7 this was recorded in the segment, and then it would affect the Page > status during updatedb. It should do so 0.8, too... -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira ------------------------------------------------------- All the advantages of Linux Managed Hosting--Without the Cost and Risk! Fully trained technicians. The highest number of Red Hat certifications in the hosting industry. Fanatical Support. Click to learn more http://sel.as-us.falkag.net/sel?cmd=lnk&kid=107521&bid=248729&dat=121642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
