[
https://issues.apache.org/jira/browse/NUTCH-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Steve Newcomb closed NUTCH-1658.
--------------------------------
Resolution: Not A Problem
> Nutch mangles seed URLs and then reports on the mangled ones
> ------------------------------------------------------------
>
> Key: NUTCH-1658
> URL: https://issues.apache.org/jira/browse/NUTCH-1658
> Project: Nutch
> Issue Type: Bug
> Environment: Ubuntu 12.04
> Reporter: Steve Newcomb
> Labels: newbie
> Fix For: 1.7
>
>
> Note: I'm using Nutch to verify that each of a long list of URIs is good, so
> I use them all as seeds in a single-iteration crawls.
> Some seed URIs are mangled by Nutch, and Nutch then reports on the mangled
> versions (which are no good) instead of the original ones (which are good).
> Two patterns have emerged from my tests:
> (1) If the query portion of the URI contains '//', it becomes '/', rendering
> the resource unfetchable. Example:
> https://www.pay.gov/paygov/forms/formInstance.html?nc=1356014395287&agencyFormId=44568890&userFormSearch=https%3A//www.pay.gov/paygov/keywordSearchForms.html%3FshowingDetails=true&showingAll=false&sortProperty=agencyFormName&totalResults=1&keyword=apma&ascending=true&pageOffset=0
> (2) If the URI has a trailing '.', it disappears, apparently rendering the
> resource unfetchable. Example:
> http://www.irs.gov/Individuals/ITIN-Policy-Change-Summary-for-2013.
> Both of the above are known good URIs. When they are used as seeds, Nutch
> 1.7 doesn't report about them, but instead reports about URIs that have been
> mangled as described above. In the '//' -> '/' case, Nutch reports that
> robot access is denied, which is probably true. In the trailing '.' case,
> Nutch says there's no such resource, which is true, but it's not the question
> I was trying to get Nutch to answer.)
--
This message was sent by Atlassian JIRA
(v6.1#6144)