[ 
https://issues.apache.org/jira/browse/NUTCH-1596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1596:
-----------------------------------

    Attachment: NUTCH-1596-v1.patch

Hi [~markus17], there may be concurrency if there are multiple fetcher threads 
and two of them happen to run HeadingsParseFilter.filter() at the same time. 
The plugin isn't safe because DocumentFragment is "passed" from filter() to 
getElement() as "shared" member variable.
                
> NodeWalker NPE on next node
> ---------------------------
>
>                 Key: NUTCH-1596
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1596
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.7
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.8
>
>         Attachments: NUTCH-1596-v1.patch
>
>
> The NodeWalker used by the HeadingsParseFilter sometimes reports a 
> NullPointerException.
> {code}
> 2013-07-02 11:02:09,428 WARN  parse.ParseUtil - Error parsing .... with 
> org.apache.nutch.parse.tika.TikaParser@2c8b586a
> java.util.concurrent.ExecutionException: java.lang.NullPointerException
>         at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:262)
>         at java.util.concurrent.FutureTask.get(FutureTask.java:119)
>         at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:162)
>         at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93)
>         at 
> org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:963)
>         at 
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:722)
> Caused by: java.lang.NullPointerException
>         at org.apache.xerces.dom.ParentNode.nodeListItem(Unknown Source)
>         at org.apache.xerces.dom.ParentNode.item(Unknown Source)
>         at org.apache.nutch.util.NodeWalker.nextNode(NodeWalker.java:75)
>         at 
> org.apache.nutch.parse.headings.HeadingsParseFilter.getElement(HeadingsParseFilter.java:84)
>         at 
> org.apache.nutch.parse.headings.HeadingsParseFilter.filter(HeadingsParseFilter.java:47)
>         at 
> org.apache.nutch.parse.HtmlParseFilters.filter(HtmlParseFilters.java:98)
>         at 
> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:210)
>         at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
>         at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
>         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:722)
> {code}
> This is strange because it only rarely fails and the nextNode() method checks 
> hasNext() and there is no concurrent access if i'm correct.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to