[ 
https://issues.apache.org/jira/browse/NUTCH-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Newcomb closed NUTCH-1658.
--------------------------------

    Resolution: Not A Problem

> Nutch mangles seed URLs and then reports on the mangled ones
> ------------------------------------------------------------
>
>                 Key: NUTCH-1658
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1658
>             Project: Nutch
>          Issue Type: Bug
>         Environment: Ubuntu 12.04
>            Reporter: Steve Newcomb
>              Labels: newbie
>             Fix For: 1.7
>
>
> Note: I'm using Nutch to verify that each of a long list of URIs is good, so 
> I use them all as seeds in a single-iteration crawls.
> Some seed URIs are mangled by Nutch, and Nutch then reports on the mangled 
> versions (which are no good) instead of the original ones (which are good).  
> Two patterns have emerged from my tests:
> (1) If the query portion of the URI contains '//', it becomes '/', rendering 
> the resource unfetchable.  Example:
> https://www.pay.gov/paygov/forms/formInstance.html?nc=1356014395287&agencyFormId=44568890&userFormSearch=https%3A//www.pay.gov/paygov/keywordSearchForms.html%3FshowingDetails=true&showingAll=false&sortProperty=agencyFormName&totalResults=1&keyword=apma&ascending=true&pageOffset=0
> (2) If the URI has a trailing '.', it disappears, apparently rendering the 
> resource unfetchable.  Example:
> http://www.irs.gov/Individuals/ITIN-Policy-Change-Summary-for-2013.
> Both of the above are known good URIs.  When they are used as seeds, Nutch 
> 1.7 doesn't report about them, but instead reports about URIs that have been 
> mangled as described above.  In the '//' -> '/' case, Nutch reports that 
> robot access is denied, which is probably true.  In the trailing '.' case, 
> Nutch says there's no such resource, which is true, but it's not the question 
> I was trying to get Nutch to answer.)



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to