Boris, you might wanna look at http://code.google.com/p/boilerpipe/

simon

On Mon, Jun 28, 2010 at 10:48 PM, Boris Aleksandrovsky
<balek...@gmail.com> wrote:
> Thanks, Sashi, I am asking more about a general library which will remove
> those HTML element which are unwanted/useless for indexing. For instance, we
> are using a general method to remove headers by comparing the structure of
> HTML on the top-level document from the site (e.g. www.nytimes.com) and the
> page being crawled (which happens to be further down in the hierarchy).
> Generally the difference will be the header or the footer. Is there a
> library out there which contains a collection of hacks like that?
>
> On Mon, Jun 28, 2010 at 1:31 PM, Shashi Kant <sk...@sloan.mit.edu> wrote:
>
>> I have used TagSoup to parse the HTML and get the elements of interest.
>> http://ccil.org/~cowan/XML/tagsoup/
>>
>>
>>
>> On Mon, Jun 28, 2010 at 4:06 PM, Boris Aleksandrovsky
>> <balek...@gmail.com> wrote:
>> > I was wondering if any of you know of any open-source solutions for
>> general
>> > issues which arise in web crawling - how do you remove
>> > headers/footers/javascript and generally cleanup html of a web-page
>> before
>> > indexing? We have a first-pass solution implemented using custom code,
>> but
>> > this must be a problem which a lot of people face, so I am asking here.
>> >
>> > Thanks,
>> > Boris
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to