[ 
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987575#action_12987575
 ] 

Markus Jelsma commented on NUTCH-961:
-------------------------------------

Boilerpipe comes with several algorithms for stripping away the boilerplate 
content. Although the ArticleExtractor is recommended, it certainly fails for 
many types of pages. Pages such as news overviews with blocks and lists are 
much better extracted with the CanolaExtractor instead. This poses a problem, 
we cannot have just one single configuration directive telling the parser which 
extractor to use for a whole crawl.

Some thoughts on how to deal with it:
- use Boilerpipe's estimator to automatically determine which extractor to use
- have a facility to override false positives returned by the estimator and 
hardcode which extractor to use for URL groups (not unlike the subcollection 
plugin)


> Expose Tika's boilerpipe support
> --------------------------------
>
>                 Key: NUTCH-961
>                 URL: https://issues.apache.org/jira/browse/NUTCH-961
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>             Fix For: 2.0
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to 
> extract boilerplate content from HTML pages. We should see how we can expose 
> Boilerplate in the Nutch cofiguration.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to