[ 
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15110373#comment-15110373
 ] 

Markus Jelsma commented on NUTCH-961:
-------------------------------------

Hello - that doesn't seem related to this issue as it doesn't interfere with 
how its loaded. Also, we cannot reproduce that locally nor in Hadoop mode. But 
there was some issue on the mailing list a couple of days ago that also 
mentioned an issue as you describe. 

> Expose Tika's boilerpipe support
> --------------------------------
>
>                 Key: NUTCH-961
>                 URL: https://issues.apache.org/jira/browse/NUTCH-961
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>         Attachments: BoilerpipeExtractorRepository.java, 
> NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch, 
> NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, 
> NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, 
> NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, 
> NUTCH-961v2.patch, nutch-2.x-boilerpipe.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to 
> extract boilerplate content from HTML pages. We should see how we can expose 
> Boilerplate in the Nutch cofiguration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to