[jira] [Created] (NUTCH-2588) Getting status code x01 (unfetched) on more than 80% crawled urls

Usama Tahir (JIRA) Mon, 28 May 2018 23:29:35 -0700

Usama Tahir created NUTCH-2588:
----------------------------------

             Summary: Getting status code x01 (unfetched) on more than 80% 
crawled urls
                 Key: NUTCH-2588
                 URL: https://issues.apache.org/jira/browse/NUTCH-2588
             Project: Nutch
          Issue Type: Bug
          Components: crawldb, fetcher
    Affects Versions: 2.3.1
         Environment: I am using apache nutch 2.3.1 with hadoop 2.7.6 and hbase 
0.98.8 hadop2.


Operating System: Ubuntu 16.04
            Reporter: Usama Tahir


when i run nucth with external links enabled, seed of 10 urls and number of 
rounds 5 using command 

bin/crawl <seed_path> <db>  [<solr url>] <number of rounds>

i have default topN value which is 50000

the process completes execution in 11 to 12 hours and generated urls rows of 
about 2.8 lac.

when we analyze hbase table and check status codes of all urls we got round 
about 1.4 lac urls having status code of x01 [un fetched] 

it means 2.4 lac urls out of 2.8 lac which nutch extracted remains unfetched.

after some debugging of nutch and analyzing its logs i found that those urls 
which have status code of x01 are not even tried to fetch.

is this the bug of nutch or something configuration issue?
kindly resolve my issue as soon as possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (NUTCH-2588) Getting status code x01 (unfetched) on more than 80% crawled urls

Reply via email to