[
https://issues.apache.org/jira/browse/NUTCH-3109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17951775#comment-17951775
]
Markus Jelsma commented on NUTCH-3109:
--------------------------------------
Yes, i am also truly puzzled by it. This is on Java 11 and Hadoop 3.3.5.
{code:java}
openjdk version "11.0.20" 2023-07-18
OpenJDK Runtime Environment (build 11.0.20+8-post-Debian-1deb11u1)
OpenJDK 64-Bit Server VM (build 11.0.20+8-post-Debian-1deb11u1, mixed mode,
sharing){code}
The affected CrawlDB has, between the first report and now, been written
thousands of times, used in even more jobs. The problem is still there.
> Unable to update CrawlDB due to URL normalization
> -------------------------------------------------
>
> Key: NUTCH-3109
> URL: https://issues.apache.org/jira/browse/NUTCH-3109
> Project: Nutch
> Issue Type: Bug
> Reporter: Markus Jelsma
> Priority: Major
>
> I routinely added new normalization rules in a custom normalizer plugin,
> nothing out of the ordinary. Updating the CrawlDB with -normalize just got me
> this:
> {code:java}
> 2025-03-24 08:01:23,166 ERROR [main] org.apache.hadoop.mapred.YarnChild:
> Error running child : java.lang.AbstractMethodError: Receiver class
> java.lang.ArrayIndexOutOfBoundsException does not define or inherit an
> implementation of the resolved method 'java.lang.String toString()' of class
> java.lang.Object.
> at java.base/java.lang.String.valueOf(String.java:2951)
> at java.base/java.lang.StringBuilder.append(StringBuilder.java:172)
> at org.apache.nutch.crawl.CrawlDbFilter.map(CrawlDbFilter.java:101)
> at org.apache.nutch.crawl.CrawlDbFilter.map(CrawlDbFilter.java:37)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:800)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:348)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:178)
> at java.base/java.security.AccessController.doPrivileged(Native Method)
> at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:172) {code}
>
> The offending line of code is the LOG.warn below:
>
> {code:java}
> if (url != null && urlNormalizers) {
> try {
> url = normalizers.normalize(url, scope); // normalize the url
> } catch (Exception e) {
> LOG.warn("Skipping " + url + ":" + e);
> url = null;
> } {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)