Steve Newcomb created NUTCH-1658:
------------------------------------

             Summary: Nutch mangles seed URLs and then reports on the mangled 
ones
                 Key: NUTCH-1658
                 URL: https://issues.apache.org/jira/browse/NUTCH-1658
             Project: Nutch
          Issue Type: Bug
         Environment: Ubuntu 12.04
            Reporter: Steve Newcomb
             Fix For: 1.7


Note: I'm using Nutch to verify that each of a long list of URIs is good, so I 
use them all as seeds in a single-iteration crawls.

Some seed URIs are mangled by Nutch, and Nutch then reports on the mangled 
versions (which are no good) instead of the original ones (which are good).  
Two patterns have emerged from my tests:

(1) If the query portion of the URI contains '//', it becomes '/', rendering 
the resource unfetchable.  Example:

https://www.pay.gov/paygov/forms/formInstance.html?nc=1356014395287&agencyFormId=44568890&userFormSearch=https%3A//www.pay.gov/paygov/keywordSearchForms.html%3FshowingDetails=true&showingAll=false&sortProperty=agencyFormName&totalResults=1&keyword=apma&ascending=true&pageOffset=0

(2) If the URI has a trailing '.', it disappears, apparently rendering the 
resource unfetchable.  Example:

http://www.irs.gov/Individuals/ITIN-Policy-Change-Summary-for-2013.

Both of the above are known good URIs.  When they are used as seeds, Nutch 1.7 
doesn't report about them, but instead reports about URIs that have been 
mangled as described above.  In the '//' -> '/' case, Nutch reports that robot 
access is denied, which is probably true.  In the trailing '.' case, Nutch says 
there's no such resource, which is true, but it's not the question I was trying 
to get Nutch to answer.)





--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to