[
https://issues.apache.org/jira/browse/NUTCH-2935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel resolved NUTCH-2935.
------------------------------------
Resolution: Fixed
> DeduplicationJob: failure on URLs with invalid percent encoding
> ---------------------------------------------------------------
>
> Key: NUTCH-2935
> URL: https://issues.apache.org/jira/browse/NUTCH-2935
> Project: Nutch
> Issue Type: Bug
> Components: crawldb
> Affects Versions: 1.18
> Reporter: Sebastian Nagel
> Assignee: Sebastian Nagel
> Priority: Major
> Fix For: 1.19
>
>
> The DeduplicationJob may fail with an IllegalArgumentException on invalid
> percent encodings in URLs:
> {noformat}
> 2021-11-25 04:36:41,747 INFO mapreduce.Job: Task Id :
> attempt_1637669672674_0018_r_000193_0, Status : FAILED
> Error: java.lang.IllegalArgumentException: URLDecoder: Illegal hex characters
> in escape (%) pattern - Error at index 0 in: "YR"
> at java.base/java.net.URLDecoder.decode(URLDecoder.java:232)
> at java.base/java.net.URLDecoder.decode(URLDecoder.java:142)
> at
> org.apache.nutch.crawl.DeduplicationJob$DedupReducer.getDuplicate(DeduplicationJob.java:211)
> ...
> Exception in thread "main" java.lang.RuntimeException: Crawl job did not
> succeed, job status:FAILED, reason: Task failed
> task_1637669672674_0018_r_000193
> Job failed as tasks failed. failedMaps:0 failedReduces:1 killedMaps:0
> killedReduces: 0
> {noformat}
> The IllegalArgumentException should be caught, logged and if only one of the
> two URLs with duplicated content is invalid, it should be flagged as
> duplicate while the valid URL "survives".
--
This message was sent by Atlassian Jira
(v8.20.1#820001)