Dear all, I am happy to announce the release of Boilerpipe 1.0.
Boilerpipe is a Java library for boilerplate removal and fulltext extraction from HTML pages. It is based on my paper "Boilerplate Detection using Shallow Text Features" to be presented at WSDM 2010 -- The Third ACM International Conference on Web Search and Data Mining, 3-6 February 2010, New York City, NY USA. The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a website. It already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings. Extracting content is very fast (milliseconds), just needs the input document (no global or site-level information required) and is usually quite accurate. You can find Boilerpipe at http://code.google.com/p/boilerpipe/ The code is released under the Apache 2.0 license and you are very welcomed to use Boilerpipe for whatever you like to. Please let me know if it helps you, if you have questions about it, difficulties with it or ideas how to improve it. Cheers, Christian PS: The website already provides version 1.0.1 (now includes the dependency jars in the binary tarball) -- Christian Kohlschütter [email protected] Forschungszentrum L3S Leibniz Universität Hannover http://www.L3S.de/~kohlschuetter/
