Hi Ted,

thanks for your email, and sorry for replying so late, I have overlooked your 
posting.

Adding boilerpipe to Lucene is definitely a good idea (I have been working with 
such a setup for a long time now).
Integrating it into an Analyzer should be fairly simple as Boilerpipe can 
return a string which in turn can be parsed just any other text.

However it would also be great (in order to increase recall) to also store 
non-content and just add some kind of static boosting for content blocks over 
non-content blocks. I am not sure whether this will work right now using an 
Analyzer. What you could do though, is to store the text into separate fields 
("content"/"boilerplate") and add field-specific boosts at query time.

Cheers,
Christian

Am 04.12.2009 um 22:59 schrieb Ted Dunning:

> Nice paper.  I haven't read the software yet, but I would expect it to have
> similar qualities.
> 
> Have you considered how boilerpipe might be integrated into a Lucene
> analyzer?
> 
> 2009/12/4 Christian Kohlschütter <[email protected]>
> 
>> Dear all,
>> 
>> I am happy to announce the release of Boilerpipe 1.0.
>> 
>> Boilerpipe is a Java library for boilerplate removal and fulltext
>> extraction from HTML pages.
>> It is based on my paper "Boilerplate Detection using Shallow Text Features"
>> to be presented at WSDM 2010 -- The Third ACM International Conference on
>> Web Search and Data Mining, 3-6 February 2010, New York City, NY USA.
>> 
>> The boilerpipe library provides algorithms to detect and remove the surplus
>> "clutter" (boilerplate, templates) around the main textual content of a
>> website. It already provides specific strategies for common tasks (for
>> example: news article extraction) and may also be easily extended for
>> individual problem settings. Extracting content is very fast (milliseconds),
>> just needs the input document (no global or site-level information required)
>> and is usually quite accurate.
>> 
>> You can find Boilerpipe at http://code.google.com/p/boilerpipe/
>> 
>> The code is released under the Apache 2.0 license and you are very welcomed
>> to use Boilerpipe for whatever you like to. Please let me know if it helps
>> you, if you have questions about it, difficulties with it or ideas how to
>> improve it.
>> 
>> Cheers,
>> Christian
>> 
>> PS: The website already provides version 1.0.1 (now includes the dependency
>> jars in the binary tarball)
>> --
>> Christian Kohlschütter
>> [email protected]
>> 
>> Forschungszentrum L3S
>> Leibniz Universität Hannover
>> 
>> http://www.L3S.de/~kohlschuetter/ <http://www.L3S.de/%7Ekohlschuetter/>
>> 
>> 
> 
> 
> -- 
> Ted Dunning, CTO
> DeepDyve

-- 
Christian Kohlschütter
[email protected]

L3S Research Center
Forschungszentrum L3S / Leibniz Universität Hannover

http://www.L3S.de/~kohlschuetter



Reply via email to