Nice paper. I haven't read the software yet, but I would expect it to have similar qualities.
Have you considered how boilerpipe might be integrated into a Lucene analyzer? 2009/12/4 Christian Kohlschütter <[email protected]> > Dear all, > > I am happy to announce the release of Boilerpipe 1.0. > > Boilerpipe is a Java library for boilerplate removal and fulltext > extraction from HTML pages. > It is based on my paper "Boilerplate Detection using Shallow Text Features" > to be presented at WSDM 2010 -- The Third ACM International Conference on > Web Search and Data Mining, 3-6 February 2010, New York City, NY USA. > > The boilerpipe library provides algorithms to detect and remove the surplus > "clutter" (boilerplate, templates) around the main textual content of a > website. It already provides specific strategies for common tasks (for > example: news article extraction) and may also be easily extended for > individual problem settings. Extracting content is very fast (milliseconds), > just needs the input document (no global or site-level information required) > and is usually quite accurate. > > You can find Boilerpipe at http://code.google.com/p/boilerpipe/ > > The code is released under the Apache 2.0 license and you are very welcomed > to use Boilerpipe for whatever you like to. Please let me know if it helps > you, if you have questions about it, difficulties with it or ideas how to > improve it. > > Cheers, > Christian > > PS: The website already provides version 1.0.1 (now includes the dependency > jars in the binary tarball) > -- > Christian Kohlschütter > [email protected] > > Forschungszentrum L3S > Leibniz Universität Hannover > > http://www.L3S.de/~kohlschuetter/ <http://www.L3S.de/%7Ekohlschuetter/> > > -- Ted Dunning, CTO DeepDyve
