Yash Thenuan created NUTCH-2526:
-----------------------------------
Summary: scoring-opic creating Issues while indexing some
documents which were generated at parsetime.
Key: NUTCH-2526
URL: https://issues.apache.org/jira/browse/NUTCH-2526
Project: Nutch
Issue Type: Improvement
Components: parser, scoring
Affects Versions: 1.14
Reporter: Yash Thenuan
Fix For: 1.15
I was trying to write a parse filter plugin whose work was to parse internal
links as a separate document.what I did basically is,breaking the page into
multiple parseResults each parseResult having ParseText and ParseData
corresponding to the InternalLinks. I was successfully able to parse them
separately. But at the time of Scoring Some Error occurred.
I am attaching the Logs for Indexing.
2018-03-07 15:41:52,327 INFO indexer.IndexerMapReduce - IndexerMapReduce:
crawldb: crawl/crawldb
2018-03-07 15:41:52,327 INFO indexer.IndexerMapReduce - IndexerMapReduce:
linkdb: crawl/linkdb
2018-03-07 15:41:52,327 INFO indexer.IndexerMapReduce - IndexerMapReduces:
adding segment: crawl/segments/20180307130959
2018-03-07 15:41:53,677 INFO anchor.AnchorIndexingFilter - Anchor
deduplication is: off
2018-03-07 15:41:54,861 INFO indexer.IndexWriters - Adding
org.apache.nutch.indexwriter.elasticrest.ElasticRestIndexWriter
2018-03-07 15:41:55,168 INFO client.AbstractJestClient - Setting server pool
to a list of 1 servers: [http://localhost:9200]
2018-03-07 15:41:55,170 INFO client.JestClientFactory - Using multi
thread/connection supporting pooling connection manager
2018-03-07 15:41:55,238 INFO client.JestClientFactory - Using default GSON
instance
2018-03-07 15:41:55,238 INFO client.JestClientFactory - Node Discovery
disabled...
2018-03-07 15:41:55,238 INFO client.JestClientFactory - Idle connection
reaping disabled...
2018-03-07 15:41:55,282 INFO elasticrest.ElasticRestIndexWriter - Processing
remaining requests [docs = 1, length = 210402, total docs = 1]
2018-03-07 15:41:55,361 INFO elasticrest.ElasticRestIndexWriter - Processing
to finalize last execute
2018-03-07 15:41:55,458 INFO elasticrest.ElasticRestIndexWriter - Previous
took in ms 175, including wait 97
2018-03-07 15:41:55,468 WARN mapred.LocalJobRunner - job_local1561152089_0001
java.lang.Exception: java.lang.NullPointerException
at
org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: java.lang.NullPointerException
at
org.apache.nutch.scoring.opic.OPICScoringFilter.indexerScore(OPICScoringFilter.java:171)
at
org.apache.nutch.scoring.ScoringFilters.indexerScore(ScoringFilters.java:120)
at
org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:296)
at
org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:57)
at
org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at
org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-03-07 15:41:55,510 ERROR indexer.IndexingJob - Indexer:
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:873)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)