[
https://issues.apache.org/jira/browse/NUTCH-2315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel updated NUTCH-2315:
-----------------------------------
Attachment: NUTCH-2315-2.3.1-1.patch
Normally, invalid URLs should be filtered aways or fixed by URL normalizers
before during parsing or in Fetcher when following redirects. However, without
or with incomplete or misconfigured URL filters/normalizers invalid outlinks
may survive. The attached patch catches a MalformedUrlException and avoids that
the DbUpdateJob fails.
> UpdateDb jobs fails everytime (Nutch 2.3.1)
> -------------------------------------------
>
> Key: NUTCH-2315
> URL: https://issues.apache.org/jira/browse/NUTCH-2315
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 2.3.1
> Environment: I am using it with Hadoop 2.7.1 + Mongo DB + Yarn + Gora
> 0.61
> Reporter: Shubham Gupta
> Labels: newbie
> Fix For: 2.4
>
> Attachments: NUTCH-2315-2.3.1-1.patch
>
>
> Hey,
> Whenever I run the update job, the following error occurs:
> INFO mapreduce.Job: Task Id : attempt_1473832356852_0107_m_000000_2, Status :
> FAILED
> Error: java.net.MalformedURLException: no protocol:
> http%3A%2F%2Fwww.smh.com.au%2Fact-news%2Fcanberra-weather-warm-april-expected-after-record-breaking-march-temperatures-20160401-gnw2pg.html&title=Canberra+weather%3A+warm+April+expected+after+record+breaking+March+temperatures&source=The+Sydney+Morning+Herald&summary=Canberra+can+expect+warmer+than+average+temperatures+to+continue+for+April+after+enjoying+its+equal+second+warmest+March+on+record
> at java.net.URL.<init>(URL.java:586)
> at java.net.URL.<init>(URL.java:483)
> at java.net.URL.<init>(URL.java:432)
> at org.apache.nutch.util.TableUtil.reverseUrl(TableUtil.java:43)
> at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:96)
> at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:38)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> 16/09/15 12:44:35 INFO mapreduce.Job: map 100% reduce 100%
> 16/09/15 12:44:36 INFO mapreduce.Job: Job job_1473832356852_0107 failed with
> state FAILED due to: Task failed task_1473832356852_0107_m_000000
> Job failed as tasks failed. failedMaps:1 failedReduces:0
> 16/09/15 12:44:36 INFO mapreduce.Job: Counters: 8
> Job Counters
> Failed map tasks=4
> Launched map tasks=4
> Other local map tasks=4
> Total time spent by all maps in occupied slots (ms)=388304
> Total time spent by all reduces in occupied slots (ms)=0
> Total time spent by all map tasks (ms)=55472
> Total vcore-seconds taken by all map tasks=55472
> Total megabyte-seconds taken by all map tasks=198145984
> Exception in thread "main" java.lang.RuntimeException: job failed:
> name=[rss]update-table, jobid=job_1473832356852_0107
> at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:119)
> at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:111)
> at
> org.apache.nutch.crawl.DbUpdaterJob.updateTable(DbUpdaterJob.java:140)
> at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:174)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.nutch.crawl.DbUpdaterJob.main(DbUpdaterJob.java:178)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)