[jira] [Commented] (NUTCH-3109) Unable to update CrawlDB due to URL normalization

Sebastian Nagel (Jira) Thu, 27 Mar 2025 10:45:59 -0700


    [ 
https://issues.apache.org/jira/browse/NUTCH-3109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17939029#comment-17939029
 ]


Sebastian Nagel commented on NUTCH-3109:
----------------------------------------

Hi [~markus],

could you share more context? Java version, Hadoop version, etc. The exception 
looks very odd.

I'm unable to reproduce the issue:
- 
[ArrayIndexOutOfBoundsException|https://docs.oracle.com/javase/8/docs/api/java/lang/ArrayIndexOutOfBoundsException.html]
 inherits toString() from Throwable
- tried the following with Java 8, 11, 17, 21:
{code}
int[] a = { };
try {
  System.out.println("" + a[0]);
} catch (Exception e) {
  System.err.println( "Got: " + e );
}
{code}
- it just works
- should use parameterized logging ({{LOG.warn("Skipping {}: ", url, e);}}), 
but that's not the solution, it would fail the same way

> Unable to update CrawlDB due to URL normalization
> -------------------------------------------------
>
>                 Key: NUTCH-3109
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3109
>             Project: Nutch
>          Issue Type: Bug
>            Reporter: Markus Jelsma
>            Priority: Major
>
> I routinely added new normalization rules in a custom normalizer plugin, 
> nothing out of the ordinary. Updating the CrawlDB with -normalize just got me 
> this:
> {code:java}
> 2025-03-24 08:01:23,166 ERROR [main] org.apache.hadoop.mapred.YarnChild: 
> Error running child : java.lang.AbstractMethodError: Receiver class 
> java.lang.ArrayIndexOutOfBoundsException does not define or inherit an 
> implementation of the resolved method 'java.lang.String toString()' of class 
> java.lang.Object.
>       at java.base/java.lang.String.valueOf(String.java:2951)
>       at java.base/java.lang.StringBuilder.append(StringBuilder.java:172)
>       at org.apache.nutch.crawl.CrawlDbFilter.map(CrawlDbFilter.java:101)
>       at org.apache.nutch.crawl.CrawlDbFilter.map(CrawlDbFilter.java:37)
>       at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
>       at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:800)
>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:348)
>       at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:178)
>       at java.base/java.security.AccessController.doPrivileged(Native Method)
>       at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
>       at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:172) {code}
>  
> The offending line of code is the LOG.warn below:
>  
> {code:java}
>     if (url != null && urlNormalizers) {
>       try {
>         url = normalizers.normalize(url, scope); // normalize the url
>       } catch (Exception e) {
>         LOG.warn("Skipping " + url + ":" + e);
>         url = null;
>       } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NUTCH-3109) Unable to update CrawlDB due to URL normalization

Reply via email to