[ https://issues.apache.org/jira/browse/NUTCH-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896921#action_12896921 ]
Julien Nioche commented on NUTCH-864: ------------------------------------- Thanks for the explanations. {quote} Redirect metadatum is used to mark it as a redirect. The idea is you can have long redirect chains but you want ONE representative URL for all of them. For example, yahoo.com may redirect to www.yahoo.com which may redirect to www.yahoo.com/index.html. In this case (IIRC), we designate www.yahoo.com to represent all the redirections. {quote} yep, that's what URLUtil.chooseRepr() is used for. As far as I can see the behaviour is mostly the same as in 1.x i.e we determine which one of the source or target looks nicer and store it in the target only. {quote} The reason we do not mark them as unfetched is they may already be fetched. Continuing from the above example, www.yahoo.com may already be FETCHED. During update step, these URLs should be recognized and then injected if necessary. I can see how it may be a bit unintiutive that until updatedb they are essentially status-less. Julien, any suggestions? {quote} We could use an explicit status code instead of relying on the default. In theory there should be no 0 status left after an update so maybe it would be an overkill to create a status just for that. > Fetcher generates entries with status 0 > --------------------------------------- > > Key: NUTCH-864 > URL: https://issues.apache.org/jira/browse/NUTCH-864 > Project: Nutch > Issue Type: Bug > Components: fetcher > Environment: Gora with SQLBackend > URL: https://svn.apache.org/repos/asf/nutch/branches/nutchbase > Last Changed Rev: 980748 > Last Changed Date: 2010-07-30 14:19:52 +0200 (Fri, 30 Jul 2010) > Reporter: Julien Nioche > Assignee: Doğacan Güney > Fix For: 2.0 > > > After a round of fetching which got the following protocol status : > 10/07/30 15:11:39 INFO mapred.JobClient: ACCESS_DENIED=2 > 10/07/30 15:11:39 INFO mapred.JobClient: SUCCESS=1177 > 10/07/30 15:11:39 INFO mapred.JobClient: GONE=3 > 10/07/30 15:11:39 INFO mapred.JobClient: TEMP_MOVED=138 > 10/07/30 15:11:39 INFO mapred.JobClient: EXCEPTION=93 > 10/07/30 15:11:39 INFO mapred.JobClient: MOVED=521 > 10/07/30 15:11:39 INFO mapred.JobClient: NOTFOUND=62 > I ran : ./nutch org.apache.nutch.crawl.WebTableReader -stats > 10/07/30 15:12:37 INFO crawl.WebTableReader: Statistics for WebTable: > 10/07/30 15:12:37 INFO crawl.WebTableReader: TOTAL urls: 2690 > 10/07/30 15:12:37 INFO crawl.WebTableReader: retry 0: 2690 > 10/07/30 15:12:37 INFO crawl.WebTableReader: min score: 0.0 > 10/07/30 15:12:37 INFO crawl.WebTableReader: avg score: 0.7587361 > 10/07/30 15:12:37 INFO crawl.WebTableReader: max score: 1.0 > 10/07/30 15:12:37 INFO crawl.WebTableReader: status 0 (null): 649 > 10/07/30 15:12:37 INFO crawl.WebTableReader: status 2 (status_fetched): > 1177 (SUCCESS=1177) > 10/07/30 15:12:37 INFO crawl.WebTableReader: status 3 (status_gone): 112 > 10/07/30 15:12:37 INFO crawl.WebTableReader: status 34 (status_retry): > 93 (EXCEPTION=93) > 10/07/30 15:12:37 INFO crawl.WebTableReader: status 4 (status_redir_temp): > 138 (TEMP_MOVED=138) > 10/07/30 15:12:37 INFO crawl.WebTableReader: status 5 (status_redir_perm): > 521 (MOVED=521) > 10/07/30 15:12:37 INFO crawl.WebTableReader: WebTable statistics: done > There should not be any entries with status 0 (null) > I will investigate a bit more... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.