I was wondering if any of you know of any open-source solutions for general issues which arise in web crawling - how do you remove headers/footers/javascript and generally cleanup html of a web-page before indexing? We have a first-pass solution implemented using custom code, but this must be a problem which a lot of people face, so I am asking here.
Thanks, Boris