[ 
https://issues.apache.org/jira/browse/NUTCH-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896921#action_12896921
 ] 

Julien Nioche commented on NUTCH-864:
-------------------------------------

Thanks for the explanations.

{quote}
Redirect metadatum is used to mark it as a redirect. The idea is you can have 
long redirect chains but you want ONE representative URL for all of them. For 
example, yahoo.com may redirect to www.yahoo.com which may redirect to 
www.yahoo.com/index.html. In this case (IIRC), we designate www.yahoo.com to 
represent all the redirections.
{quote}

yep, that's what URLUtil.chooseRepr() is used for. As far as I can see the 
behaviour is mostly the same as in 1.x i.e we determine which one of the source 
or target looks nicer and store it in the target only. 

{quote}
The reason we do not mark them as unfetched is they may already be fetched. 
Continuing from the above example, www.yahoo.com may already be FETCHED. During 
update step, these URLs should be recognized and then injected  if necessary. I 
can see how it may be a bit unintiutive that until updatedb they are 
essentially status-less. Julien, any suggestions?
{quote}

We could use an explicit status code instead of relying on the default. In 
theory there should be no 0 status left after an update so maybe it would be an 
overkill to create a status just for that. 

 


> Fetcher generates entries with status 0
> ---------------------------------------
>
>                 Key: NUTCH-864
>                 URL: https://issues.apache.org/jira/browse/NUTCH-864
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>         Environment: Gora with SQLBackend
> URL: https://svn.apache.org/repos/asf/nutch/branches/nutchbase
> Last Changed Rev: 980748
> Last Changed Date: 2010-07-30 14:19:52 +0200 (Fri, 30 Jul 2010)
>            Reporter: Julien Nioche
>            Assignee: Doğacan Güney
>             Fix For: 2.0
>
>
> After a round of fetching which got the following protocol status :
> 10/07/30 15:11:39 INFO mapred.JobClient:     ACCESS_DENIED=2
> 10/07/30 15:11:39 INFO mapred.JobClient:     SUCCESS=1177
> 10/07/30 15:11:39 INFO mapred.JobClient:     GONE=3
> 10/07/30 15:11:39 INFO mapred.JobClient:     TEMP_MOVED=138
> 10/07/30 15:11:39 INFO mapred.JobClient:     EXCEPTION=93
> 10/07/30 15:11:39 INFO mapred.JobClient:     MOVED=521
> 10/07/30 15:11:39 INFO mapred.JobClient:     NOTFOUND=62
> I ran : ./nutch org.apache.nutch.crawl.WebTableReader -stats
> 10/07/30 15:12:37 INFO crawl.WebTableReader: Statistics for WebTable: 
> 10/07/30 15:12:37 INFO crawl.WebTableReader: TOTAL urls:      2690
> 10/07/30 15:12:37 INFO crawl.WebTableReader: retry 0: 2690
> 10/07/30 15:12:37 INFO crawl.WebTableReader: min score:       0.0
> 10/07/30 15:12:37 INFO crawl.WebTableReader: avg score:       0.7587361
> 10/07/30 15:12:37 INFO crawl.WebTableReader: max score:       1.0
> 10/07/30 15:12:37 INFO crawl.WebTableReader: status 0 (null): 649
> 10/07/30 15:12:37 INFO crawl.WebTableReader: status 2 (status_fetched):       
> 1177 (SUCCESS=1177)
> 10/07/30 15:12:37 INFO crawl.WebTableReader: status 3 (status_gone):  112 
> 10/07/30 15:12:37 INFO crawl.WebTableReader: status 34 (status_retry):        
> 93 (EXCEPTION=93)
> 10/07/30 15:12:37 INFO crawl.WebTableReader: status 4 (status_redir_temp):    
> 138  (TEMP_MOVED=138)
> 10/07/30 15:12:37 INFO crawl.WebTableReader: status 5 (status_redir_perm):    
> 521 (MOVED=521)
> 10/07/30 15:12:37 INFO crawl.WebTableReader: WebTable statistics: done
> There should not be any entries with status 0 (null)
> I will investigate a bit more...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to