Steve Newcomb created NUTCH-1658:
------------------------------------
Summary: Nutch mangles seed URLs and then reports on the mangled
ones
Key: NUTCH-1658
URL: https://issues.apache.org/jira/browse/NUTCH-1658
Project: Nutch
Issue Type: Bug
Environment: Ubuntu 12.04
Reporter: Steve Newcomb
Fix For: 1.7
Note: I'm using Nutch to verify that each of a long list of URIs is good, so I
use them all as seeds in a single-iteration crawls.
Some seed URIs are mangled by Nutch, and Nutch then reports on the mangled
versions (which are no good) instead of the original ones (which are good).
Two patterns have emerged from my tests:
(1) If the query portion of the URI contains '//', it becomes '/', rendering
the resource unfetchable. Example:
https://www.pay.gov/paygov/forms/formInstance.html?nc=1356014395287&agencyFormId=44568890&userFormSearch=https%3A//www.pay.gov/paygov/keywordSearchForms.html%3FshowingDetails=true&showingAll=false&sortProperty=agencyFormName&totalResults=1&keyword=apma&ascending=true&pageOffset=0
(2) If the URI has a trailing '.', it disappears, apparently rendering the
resource unfetchable. Example:
http://www.irs.gov/Individuals/ITIN-Policy-Change-Summary-for-2013.
Both of the above are known good URIs. When they are used as seeds, Nutch 1.7
doesn't report about them, but instead reports about URIs that have been
mangled as described above. In the '//' -> '/' case, Nutch reports that robot
access is denied, which is probably true. In the trailing '.' case, Nutch says
there's no such resource, which is true, but it's not the question I was trying
to get Nutch to answer.)
--
This message was sent by Atlassian JIRA
(v6.1#6144)