Sebastian Nagel created NUTCH-2935:
--------------------------------------

             Summary: DeduplicationJob: failure on URLs with invalid percent 
encoding
                 Key: NUTCH-2935
                 URL: https://issues.apache.org/jira/browse/NUTCH-2935
             Project: Nutch
          Issue Type: Bug
          Components: crawldb
    Affects Versions: 1.18
            Reporter: Sebastian Nagel
            Assignee: Sebastian Nagel
             Fix For: 1.19


The DeduplicationJob may fail with an IllegalArgumentException on invalid 
percent encodings in URLs:
{noformat}
2021-11-25 04:36:41,747 INFO mapreduce.Job: Task Id : 
attempt_1637669672674_0018_r_000193_0, Status : FAILED
Error: java.lang.IllegalArgumentException: URLDecoder: Illegal hex characters 
in escape (%) pattern - Error at index 0 in: "YR"
        at java.base/java.net.URLDecoder.decode(URLDecoder.java:232)
        at java.base/java.net.URLDecoder.decode(URLDecoder.java:142)
        at 
org.apache.nutch.crawl.DeduplicationJob$DedupReducer.getDuplicate(DeduplicationJob.java:211)
...
Exception in thread "main" java.lang.RuntimeException: Crawl job did not 
succeed, job status:FAILED, reason: Task failed task_1637669672674_0018_r_000193
Job failed as tasks failed. failedMaps:0 failedReduces:1 killedMaps:0 
killedReduces: 0
{noformat}

The IllegalArgumentException should be caught, logged and if only one of the 
two URLs with duplicated content is invalid, it should be flagged as duplicate 
while the valid URL "survives".



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to