[jira] [Commented] (NUTCH-2588) Getting status code x01 (unfetched) on more than 80% crawled urls

Usama Tahir (JIRA) Tue, 29 May 2018 22:57:35 -0700


    [ 
https://issues.apache.org/jira/browse/NUTCH-2588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16494720#comment-16494720
 ]


Usama Tahir commented on NUTCH-2588:
------------------------------------

[~wastl-nagel] you saved my day.!

i have tested it that "All URLs found in the last cycle will be fetched in 
later cycle."

i want to know is there any seed url limit in nutch?
according to you  "By default bin/crawl does only spent 3 hours to fetch URLs 
of one cycle" , now if I have  a seed list of more than 20 thousand sites and 
according to this 3 hours cycle condition, it seems impossible for nutch to be 
able to fetch all of the urls in seed list in first cycle. I want to know what 
happens with the remaining seed list urls, are they left for good or are they 
tried in the next cycle?

 

Also, if the nutch has a default limit of 3 hours for each cycle, doesnt that 
create a lot of backlog of urls that never get feteched?

> Getting status code x01 (unfetched) on more than 80% crawled urls
> -----------------------------------------------------------------
>
>                 Key: NUTCH-2588
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2588
>             Project: Nutch
>          Issue Type: Bug
>          Components: crawldb, fetcher
>    Affects Versions: 2.3.1
>         Environment: I am using apache nutch 2.3.1 with hadoop 2.7.6 and 
> hbase 0.98.8 hadop2.
> Operating System: Ubuntu 16.04
>            Reporter: Usama Tahir
>            Priority: Major
>
> when i run nucth with external links enabled, seed of 10 urls and number of 
> rounds 5 using command 
> bin/crawl <seed_path> <db>  [<solr url>] <number of rounds>
> i have default topN value which is 50000
> the process completes execution in 11 to 12 hours and generated urls rows of 
> about 280000.
> when we analyze hbase table and check status codes of all urls we got round 
> about 242000 urls having status code of x01 [un fetched] 
> it means 242000 urls out of 280000 which nutch extracted remains unfetched.
> after some debugging of nutch and analyzing its logs i found that those urls 
> which have status code of x01 are not even tried to fetch.
> is this the bug of nutch or something configuration issue?
>  kindly resolve my issue as soon as possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2588) Getting status code x01 (unfetched) on more than 80% crawled urls

Reply via email to