[
https://issues.apache.org/jira/browse/NUTCH-2526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16447865#comment-16447865
]
ASF GitHub Bot commented on NUTCH-2526:
---------------------------------------
sebastian-nagel opened a new pull request #324: NUTCH-2526 NPE in scoring-opic
when indexing document without CrawlDb datum
URL: https://github.com/apache/nutch/pull/324
- fix scoring-opic and scoring-link
- check whether CrawlDb datum is null before reading its score
(after NUTCH-2456 which allows to index pages/URLs not contained in
CrawlDb)
- complete Java doc
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> NPE in scoring-opic when indexing document without CrawlDb datum
> ----------------------------------------------------------------
>
> Key: NUTCH-2526
> URL: https://issues.apache.org/jira/browse/NUTCH-2526
> Project: Nutch
> Issue Type: Improvement
> Components: parser, scoring
> Affects Versions: 1.14
> Reporter: Yash Thenuan
> Assignee: Sebastian Nagel
> Priority: Major
> Fix For: 1.15
>
>
> I was trying to write a parse filter plugin whose work was to parse internal
> links as a separate document.what I did basically is,breaking the page into
> multiple parseResults each parseResult having ParseText and ParseData
> corresponding to the InternalLinks. I was successfully able to parse them
> separately. But at the time of Scoring Some Error occurred.
> I am attaching the Logs for Indexing.
>
> 2018-03-07 15:41:52,327 INFO indexer.IndexerMapReduce - IndexerMapReduce:
> crawldb: crawl/crawldb
> 2018-03-07 15:41:52,327 INFO indexer.IndexerMapReduce - IndexerMapReduce:
> linkdb: crawl/linkdb
> 2018-03-07 15:41:52,327 INFO indexer.IndexerMapReduce - IndexerMapReduces:
> adding segment: crawl/segments/20180307130959
> 2018-03-07 15:41:53,677 INFO anchor.AnchorIndexingFilter - Anchor
> deduplication is: off
> 2018-03-07 15:41:54,861 INFO indexer.IndexWriters - Adding
> org.apache.nutch.indexwriter.elasticrest.ElasticRestIndexWriter
> 2018-03-07 15:41:55,168 INFO client.AbstractJestClient - Setting server pool
> to a list of 1 servers: [http://localhost:9200]
> 2018-03-07 15:41:55,170 INFO client.JestClientFactory - Using multi
> thread/connection supporting pooling connection manager
> 2018-03-07 15:41:55,238 INFO client.JestClientFactory - Using default GSON
> instance
> 2018-03-07 15:41:55,238 INFO client.JestClientFactory - Node Discovery
> disabled...
> 2018-03-07 15:41:55,238 INFO client.JestClientFactory - Idle connection
> reaping disabled...
> 2018-03-07 15:41:55,282 INFO elasticrest.ElasticRestIndexWriter - Processing
> remaining requests [docs = 1, length = 210402, total docs = 1]
> 2018-03-07 15:41:55,361 INFO elasticrest.ElasticRestIndexWriter - Processing
> to finalize last execute
> 2018-03-07 15:41:55,458 INFO elasticrest.ElasticRestIndexWriter - Previous
> took in ms 175, including wait 97
> 2018-03-07 15:41:55,468 WARN mapred.LocalJobRunner - job_local1561152089_0001
> java.lang.Exception: java.lang.NullPointerException
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
> Caused by: java.lang.NullPointerException
> at
> org.apache.nutch.scoring.opic.OPICScoringFilter.indexerScore(OPICScoringFilter.java:171)
> at
> org.apache.nutch.scoring.ScoringFilters.indexerScore(ScoringFilters.java:120)
> at
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:296)
> at
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:57)
> at
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> 2018-03-07 15:41:55,510 ERROR indexer.IndexingJob - Indexer:
> java.io.IOException: Job failed!
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:873)
> at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
> at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)