[ 
https://issues.apache.org/jira/browse/NUTCH-2124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14943382#comment-14943382
 ] 

Yogendra Kumar Soni commented on NUTCH-2124:
--------------------------------------------

Hello Sebastian,
applied the patch, problem is still there. I have not done any investigation. I 
will get back after finding the cause.
There are some more issues , some sites uses redirection for  getting sessionid 
(cookies) and it may get redirected to domain that we don't know in advance and 
redirect back to original url with session cookies. If we follow redirect till 
http status 200  and then apply url filters when follow redirect is enabled 
these kind of sites can be crawled. 

> redirect following same link again and again , max redirect exceed and went 
> db_gone
> -----------------------------------------------------------------------------------
>
>                 Key: NUTCH-2124
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2124
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.11
>            Reporter: Yogendra Kumar Soni
>            Priority: Blocker
>              Labels: db_gone, fetcher, redirect
>             Fix For: 1.11
>
>         Attachments: NUTCH-2124.patch
>
>
> Hello, followredirect is not working in trunk. please see the below log.
> Fetcher: throughput threshold retries: 5
> fetcher.maxNum.threads can't be < than 50 : using 50 instead
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, 
> fetchQueues.getQueueCount=1
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, 
> fetchQueues.getQueueCount=1
> {color:red}
> fetching http://www.wikipedia.com/wiki/URL_redirection (queue crawl 
> delay=5000ms)
> fetching http://www.wikipedia.com/wiki/URL_redirection (queue crawl 
> delay=5000ms)
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, 
> fetchQueues.getQueueCount=2
> fetching http://www.wikipedia.com/wiki/URL_redirection (queue crawl 
> delay=5000ms)
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, 
> fetchQueues.getQueueCount=2
> fetching http://www.wikipedia.com/wiki/URL_redirection (queue crawl 
> delay=5000ms)
> fetching http://www.wikipedia.com/wiki/URL_redirection (queue crawl 
> delay=5000ms)
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, 
> fetchQueues.getQueueCount=2
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, 
> fetchQueues.getQueueCount=2
>  - redirect count exceeded http://www.wikipedia.com/wiki/URL_redirection
> {color}
> Thread FetcherThread has no more work available
> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, 
> fetchQueues.getQueueCount=2
> -activeThreads=0
> Fetcher: finished at 2015-09-28 19:32:05, elapsed: 00:00:09
> Parsing : 20150928193153



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to