Markus Jelsma created NUTCH-3109: ------------------------------------ Summary: Unable to update CrawlDB due to URL normalization Key: NUTCH-3109 URL: https://issues.apache.org/jira/browse/NUTCH-3109 Project: Nutch Issue Type: Bug Reporter: Markus Jelsma
I routinely added new normalization rules in a custom normalizer plugin, nothing out of the ordinary. Updating the CrawlDB with -normalize just got me this: {code:java} 2025-03-24 08:01:23,166 ERROR [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.AbstractMethodError: Receiver class java.lang.ArrayIndexOutOfBoundsException does not define or inherit an implementation of the resolved method 'java.lang.String toString()' of class java.lang.Object. at java.base/java.lang.String.valueOf(String.java:2951) at java.base/java.lang.StringBuilder.append(StringBuilder.java:172) at org.apache.nutch.crawl.CrawlDbFilter.map(CrawlDbFilter.java:101) at org.apache.nutch.crawl.CrawlDbFilter.map(CrawlDbFilter.java:37) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:800) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:348) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:178) at java.base/java.security.AccessController.doPrivileged(Native Method) at java.base/javax.security.auth.Subject.doAs(Subject.java:423) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:172) {code} The offending line of code is the LOG.warn below: {code:java} if (url != null && urlNormalizers) { try { url = normalizers.normalize(url, scope); // normalize the url } catch (Exception e) { LOG.warn("Skipping " + url + ":" + e); url = null; } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)