Announcement: Boilerplate removal library

Christian Kohlschütter Fri, 04 Dec 2009 12:34:23 -0800

Dear all,

I am happy to announce the release of Boilerpipe 1.0.


Boilerpipe is a Java library for boilerplate removal and fulltext extraction 
from HTML pages.
It is based on my paper "Boilerplate Detection using Shallow Text Features"  to 
be presented at WSDM 2010 -- The Third ACM International Conference on Web 
Search and Data Mining, 3-6 February 2010, New York City, NY USA.

The boilerpipe library provides algorithms to detect and remove the surplus 
"clutter" (boilerplate, templates) around the main textual content of a 
website. It already provides specific strategies for common tasks (for example: 
news article extraction) and may also be easily extended for individual problem 
settings. Extracting content is very fast (milliseconds), just needs the input 
document (no global or site-level information required) and is usually quite 
accurate.

You can find Boilerpipe at http://code.google.com/p/boilerpipe/

The code is released under the Apache 2.0 license and you are very welcomed to 
use Boilerpipe for whatever you like to. Please let me know if it helps you, if 
you have questions about it, difficulties with it or ideas how to improve it.

Cheers,
Christian

PS: The website already provides version 1.0.1 (now includes the dependency 
jars in the binary tarball)
-- 
Christian Kohlschütter
[email protected]

Forschungszentrum L3S
Leibniz Universität Hannover

http://www.L3S.de/~kohlschuetter/

Announcement: Boilerplate removal library

Reply via email to