Nice paper.  I haven't read the software yet, but I would expect it to have
similar qualities.

Have you considered how boilerpipe might be integrated into a Lucene
analyzer?

2009/12/4 Christian Kohlschütter <[email protected]>

> Dear all,
>
> I am happy to announce the release of Boilerpipe 1.0.
>
> Boilerpipe is a Java library for boilerplate removal and fulltext
> extraction from HTML pages.
> It is based on my paper "Boilerplate Detection using Shallow Text Features"
>  to be presented at WSDM 2010 -- The Third ACM International Conference on
> Web Search and Data Mining, 3-6 February 2010, New York City, NY USA.
>
> The boilerpipe library provides algorithms to detect and remove the surplus
> "clutter" (boilerplate, templates) around the main textual content of a
> website. It already provides specific strategies for common tasks (for
> example: news article extraction) and may also be easily extended for
> individual problem settings. Extracting content is very fast (milliseconds),
> just needs the input document (no global or site-level information required)
> and is usually quite accurate.
>
> You can find Boilerpipe at http://code.google.com/p/boilerpipe/
>
> The code is released under the Apache 2.0 license and you are very welcomed
> to use Boilerpipe for whatever you like to. Please let me know if it helps
> you, if you have questions about it, difficulties with it or ideas how to
> improve it.
>
> Cheers,
> Christian
>
> PS: The website already provides version 1.0.1 (now includes the dependency
> jars in the binary tarball)
> --
> Christian Kohlschütter
> [email protected]
>
> Forschungszentrum L3S
> Leibniz Universität Hannover
>
> http://www.L3S.de/~kohlschuetter/ <http://www.L3S.de/%7Ekohlschuetter/>
>
>


-- 
Ted Dunning, CTO
DeepDyve

Reply via email to