[jira] [Commented] (NUTCH-2526) NPE in scoring-opic when indexing document without CrawlDb datum

ASF GitHub Bot (JIRA) Mon, 23 Apr 2018 02:54:33 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-2526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16447865#comment-16447865
 ]


ASF GitHub Bot commented on NUTCH-2526:
---------------------------------------

sebastian-nagel opened a new pull request #324: NUTCH-2526 NPE in scoring-opic 
when indexing document without CrawlDb datum
URL: https://github.com/apache/nutch/pull/324
 
 
   - fix scoring-opic and scoring-link
   - check whether CrawlDb datum is null before reading its score
     (after NUTCH-2456 which allows to index pages/URLs not contained in 
CrawlDb)
   - complete Java doc

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> NPE in scoring-opic when indexing document without CrawlDb datum
> ----------------------------------------------------------------
>
>                 Key: NUTCH-2526
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2526
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser, scoring
>    Affects Versions: 1.14
>            Reporter: Yash Thenuan
>            Assignee: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.15
>
>
> I was trying to write a parse filter plugin whose work was to parse internal 
> links as a separate document.what I did basically is,breaking the page into 
> multiple parseResults each parseResult having ParseText and ParseData 
> corresponding to the InternalLinks. I was successfully able to parse them 
> separately. But at the time of Scoring Some Error occurred.
> I am attaching the Logs for Indexing.
>  
>  2018-03-07 15:41:52,327 INFO  indexer.IndexerMapReduce - IndexerMapReduce: 
> crawldb: crawl/crawldb
> 2018-03-07 15:41:52,327 INFO  indexer.IndexerMapReduce - IndexerMapReduce: 
> linkdb: crawl/linkdb
> 2018-03-07 15:41:52,327 INFO  indexer.IndexerMapReduce - IndexerMapReduces: 
> adding segment: crawl/segments/20180307130959
> 2018-03-07 15:41:53,677 INFO  anchor.AnchorIndexingFilter - Anchor 
> deduplication is: off
> 2018-03-07 15:41:54,861 INFO  indexer.IndexWriters - Adding 
> org.apache.nutch.indexwriter.elasticrest.ElasticRestIndexWriter
> 2018-03-07 15:41:55,168 INFO  client.AbstractJestClient - Setting server pool 
> to a list of 1 servers: [http://localhost:9200]
> 2018-03-07 15:41:55,170 INFO  client.JestClientFactory - Using multi 
> thread/connection supporting pooling connection manager
> 2018-03-07 15:41:55,238 INFO  client.JestClientFactory - Using default GSON 
> instance
> 2018-03-07 15:41:55,238 INFO  client.JestClientFactory - Node Discovery 
> disabled...
> 2018-03-07 15:41:55,238 INFO  client.JestClientFactory - Idle connection 
> reaping disabled...
> 2018-03-07 15:41:55,282 INFO  elasticrest.ElasticRestIndexWriter - Processing 
> remaining requests [docs = 1, length = 210402, total docs = 1]
> 2018-03-07 15:41:55,361 INFO  elasticrest.ElasticRestIndexWriter - Processing 
> to finalize last execute
> 2018-03-07 15:41:55,458 INFO  elasticrest.ElasticRestIndexWriter - Previous 
> took in ms 175, including wait 97
> 2018-03-07 15:41:55,468 WARN  mapred.LocalJobRunner - job_local1561152089_0001
> java.lang.Exception: java.lang.NullPointerException
>       at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
>       at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
> Caused by: java.lang.NullPointerException
>       at 
> org.apache.nutch.scoring.opic.OPICScoringFilter.indexerScore(OPICScoringFilter.java:171)
>       at 
> org.apache.nutch.scoring.ScoringFilters.indexerScore(ScoringFilters.java:120)
>       at 
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:296)
>       at 
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:57)
>       at 
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
>       at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
>       at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:748)
> 2018-03-07 15:41:55,510 ERROR indexer.IndexingJob - Indexer: 
> java.io.IOException: Job failed!
>       at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:873)
>       at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
>       at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
>       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>       at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2526) NPE in scoring-opic when indexing document without CrawlDb datum

Reply via email to