I have used TagSoup to parse the HTML and get the elements of interest. http://ccil.org/~cowan/XML/tagsoup/
On Mon, Jun 28, 2010 at 4:06 PM, Boris Aleksandrovsky <balek...@gmail.com> wrote: > I was wondering if any of you know of any open-source solutions for general > issues which arise in web crawling - how do you remove > headers/footers/javascript and generally cleanup html of a web-page before > indexing? We have a first-pass solution implemented using custom code, but > this must be a problem which a lot of people face, so I am asking here. > > Thanks, > Boris > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org