header/footer identification and general scaping tools

Boris Aleksandrovsky Mon, 28 Jun 2010 13:08:16 -0700

I was wondering if any of you know of any open-source solutions for general
issues which arise in web crawling - how do you remove
headers/footers/javascript and generally cleanup html of a web-page before
indexing? We have a first-pass solution implemented using custom code, but
this must be a problem which a lot of people face, so I am asking here.


Thanks,
Boris

header/footer identification and general scaping tools

Reply via email to