[
https://issues.apache.org/jira/browse/TIKA-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877047#action_12877047
]
Christian Kohlschütter commented on TIKA-420:
---------------------------------------------
Lately I have been busy with other things, unfortunately.
Here is a short update, given the recent presentation of Safari Reader (based
upon arc90's readability bookmarklet), which provides functionality similar to
boilerpipe.
Using the L3S-GN1 test document collection (622 news articles, crawled via
GoogleNews; http://www.l3s.de/~kohlschuetter/boilerplate/ ) I found out that
Safari Reader in many cases it essentially fails to produce any content (there
is no "Reader" button available for 238 of the 622 pages), yielding a very low
average F1 score and a significantly lower median score than boilerpipe's
DefaultExtractor or ArticleExtractor.
I am currently reviewing the results from arc90's Readability code to see
whether there are any fundamental differences between Apple's implementation
and theirs.
To summarize, I think boilerpipe is a very efficient, effective and especially
stable tool (read: consistent over a broad variety of sources) for removing
boilerplate / clutter, and ahead of the competition :)
> [PATCH] Integration of boilerpipe: Boilerplate Removal and Fulltext
> Extraction from HTML pages
> ----------------------------------------------------------------------------------------------
>
> Key: TIKA-420
> URL: https://issues.apache.org/jira/browse/TIKA-420
> Project: Tika
> Issue Type: New Feature
> Components: parser
> Reporter: Christian Kohlschütter
> Assignee: Ken Krugler
> Attachments: tika-app.patch, tika-parsers.patch
>
>
> Hi all,
> while Tika already provides a parser for HTML that extracts the plain text
> from it, the output generally contains all text portions, including the
> surplus "clutter" such as navigation menus, links to related pages etc.
> around the actual main content. This "boilerplate text" typically is not
> related to the main content and may deteriorate search precision.
> I think Tika should be able to automatically detect and remove the
> boilerplate text. I propose to use "boilerpipe" for this purpose, an Apache
> 2.0 licensed Java library written by me. Boilerpipe provides both generic and
> specific strategies for common tasks (for example: news article extraction)
> and may also be easily extended for individual problem settings.
> Extracting content is very fast (milliseconds), just needs the input document
> (no global or site-level information required) and is usually quite accurate.
> In fact, it outperformed the state-of-the-art approaches for several test
> collections.
> The algorithms used by the library are based on (and extending) some concepts
> of my paper "Boilerplate Detection using Shallow Text Features", presented at
> WSDM 2010 -- The Third ACM International Conference on Web Search and Data
> Mining New York City, NY USA. (online at
> http://www.l3s.de/~kohlschuetter/boilerplate/ )
> To use boilerpipe with Tika, I have developed a custom ContentHandler
> (BoilerpipeContentHandler; provided as a patch to tika-parsers) that can
> simply be passed to HtmlParser#parse. The BoilerpipeContentHandler can be
> configured in several ways, particularly which extraction strategy should be
> used and where the extracted content should go -- into Metadata or to a
> Writer).
> I also provide a patch to TikaCLI, such that you can use boilerpipe via Tika
> from the command line (use a capital "-T" flag instead of "-t" to extract the
> main content only).
> I must note that boilerplate removal is considered a research problem:
> While one can always find clever rules to extract the main content from
> particular web pages with 100% accuracy, applying it to random, previously
> unseen pages on the web is non-trivial.
> In my paper, I have shown that on the Web (i.e. independent of a particular
> site owner, page layout etc.), textual content can apparently be grouped into
> two classes, long text (i.e., a lot of subsequent words without markup --
> most likely the actual content) and short text (i.e., a few words between two
> HTML tags, most likely navigational boilerplate text) respectively. Removing
> the words from the short text class alone already is a good strategy for
> cleaning boilerplate and using a combination of multiple shallow text
> features achieves an almost perfect accuracy. To a large extent the detection
> of boilerplate text does not require any inter-document knowledge (frequency
> of text blocks, common page layout etc.) nor any training at token level. The
> costs for detecting boilerplates are negligible, as it comes down simply to
> counting words.
> The algorithms provided in my paper seem to generally work well and
> especially for news article-like pages (for a Zipf-representative collection
> of English news pages crawled via Google News: 90-95% F1 on average, 95-98%
> F1 median), well ahead of the competition (78-89% avg. F1, 87-95% median F1).
> Patches are attached, questions welcome.
> Best,
> Christian
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.