I was wondering if any of you know of any open-source solutions for general
issues which arise in web crawling - how do you remove
headers/footers/javascript and generally cleanup html of a web-page before
indexing? We have a first-pass solution implemented using custom code, but
this must be a problem which a lot of people face, so I am asking here.

Thanks,
Boris

Reply via email to