Boris, you might wanna look at http://code.google.com/p/boilerpipe/
simon On Mon, Jun 28, 2010 at 10:48 PM, Boris Aleksandrovsky <balek...@gmail.com> wrote: > Thanks, Sashi, I am asking more about a general library which will remove > those HTML element which are unwanted/useless for indexing. For instance, we > are using a general method to remove headers by comparing the structure of > HTML on the top-level document from the site (e.g. www.nytimes.com) and the > page being crawled (which happens to be further down in the hierarchy). > Generally the difference will be the header or the footer. Is there a > library out there which contains a collection of hacks like that? > > On Mon, Jun 28, 2010 at 1:31 PM, Shashi Kant <sk...@sloan.mit.edu> wrote: > >> I have used TagSoup to parse the HTML and get the elements of interest. >> http://ccil.org/~cowan/XML/tagsoup/ >> >> >> >> On Mon, Jun 28, 2010 at 4:06 PM, Boris Aleksandrovsky >> <balek...@gmail.com> wrote: >> > I was wondering if any of you know of any open-source solutions for >> general >> > issues which arise in web crawling - how do you remove >> > headers/footers/javascript and generally cleanup html of a web-page >> before >> > indexing? We have a first-pass solution implemented using custom code, >> but >> > this must be a problem which a lot of people face, so I am asking here. >> > >> > Thanks, >> > Boris >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org