Sebastian Nagel created NUTCH-2935:
--------------------------------------
Summary: DeduplicationJob: failure on URLs with invalid percent
encoding
Key: NUTCH-2935
URL: https://issues.apache.org/jira/browse/NUTCH-2935
Project: Nutch
Issue Type: Bug
Components: crawldb
Affects Versions: 1.18
Reporter: Sebastian Nagel
Assignee: Sebastian Nagel
Fix For: 1.19
The DeduplicationJob may fail with an IllegalArgumentException on invalid
percent encodings in URLs:
{noformat}
2021-11-25 04:36:41,747 INFO mapreduce.Job: Task Id :
attempt_1637669672674_0018_r_000193_0, Status : FAILED
Error: java.lang.IllegalArgumentException: URLDecoder: Illegal hex characters
in escape (%) pattern - Error at index 0 in: "YR"
at java.base/java.net.URLDecoder.decode(URLDecoder.java:232)
at java.base/java.net.URLDecoder.decode(URLDecoder.java:142)
at
org.apache.nutch.crawl.DeduplicationJob$DedupReducer.getDuplicate(DeduplicationJob.java:211)
...
Exception in thread "main" java.lang.RuntimeException: Crawl job did not
succeed, job status:FAILED, reason: Task failed task_1637669672674_0018_r_000193
Job failed as tasks failed. failedMaps:0 failedReduces:1 killedMaps:0
killedReduces: 0
{noformat}
The IllegalArgumentException should be caught, logged and if only one of the
two URLs with duplicated content is invalid, it should be flagged as duplicate
while the valid URL "survives".
--
This message was sent by Atlassian Jira
(v8.20.1#820001)