Hi Ted,
thanks for your email, and sorry for replying so late, I have overlooked your
posting.
Adding boilerpipe to Lucene is definitely a good idea (I have been working with
such a setup for a long time now).
Integrating it into an Analyzer should be fairly simple as Boilerpipe can
return a string which in turn can be parsed just any other text.
However it would also be great (in order to increase recall) to also store
non-content and just add some kind of static boosting for content blocks over
non-content blocks. I am not sure whether this will work right now using an
Analyzer. What you could do though, is to store the text into separate fields
("content"/"boilerplate") and add field-specific boosts at query time.
Cheers,
Christian
Am 04.12.2009 um 22:59 schrieb Ted Dunning:
> Nice paper. I haven't read the software yet, but I would expect it to have
> similar qualities.
>
> Have you considered how boilerpipe might be integrated into a Lucene
> analyzer?
>
> 2009/12/4 Christian Kohlschütter <[email protected]>
>
>> Dear all,
>>
>> I am happy to announce the release of Boilerpipe 1.0.
>>
>> Boilerpipe is a Java library for boilerplate removal and fulltext
>> extraction from HTML pages.
>> It is based on my paper "Boilerplate Detection using Shallow Text Features"
>> to be presented at WSDM 2010 -- The Third ACM International Conference on
>> Web Search and Data Mining, 3-6 February 2010, New York City, NY USA.
>>
>> The boilerpipe library provides algorithms to detect and remove the surplus
>> "clutter" (boilerplate, templates) around the main textual content of a
>> website. It already provides specific strategies for common tasks (for
>> example: news article extraction) and may also be easily extended for
>> individual problem settings. Extracting content is very fast (milliseconds),
>> just needs the input document (no global or site-level information required)
>> and is usually quite accurate.
>>
>> You can find Boilerpipe at http://code.google.com/p/boilerpipe/
>>
>> The code is released under the Apache 2.0 license and you are very welcomed
>> to use Boilerpipe for whatever you like to. Please let me know if it helps
>> you, if you have questions about it, difficulties with it or ideas how to
>> improve it.
>>
>> Cheers,
>> Christian
>>
>> PS: The website already provides version 1.0.1 (now includes the dependency
>> jars in the binary tarball)
>> --
>> Christian Kohlschütter
>> [email protected]
>>
>> Forschungszentrum L3S
>> Leibniz Universität Hannover
>>
>> http://www.L3S.de/~kohlschuetter/ <http://www.L3S.de/%7Ekohlschuetter/>
>>
>>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
--
Christian Kohlschütter
[email protected]
L3S Research Center
Forschungszentrum L3S / Leibniz Universität Hannover
http://www.L3S.de/~kohlschuetter