Hi Henri, Sorry for the extra email here. I think I am also particularly interested in differentiating between the Page, Fetchers, Parsers, and Scrapers. I'm just trying to distinguish who does what. I think my primary confusion goes between HttpFetcher and HtmlScraper. What exactly is the difference and in what situations would I use one over the other.
BTW, besides some of the minor confusion I have, I feel that this is what I've been looking for. I'm primarily wanting to use it to log-in to my financial accounts and to read my monthly statements that are online. I definteily appreciate you informing me about your components. Your help is greatly appreciated! -Brant -----Original Message----- From: Henri Yandell [mailto:[EMAIL PROTECTED] Sent: Saturday, November 20, 2004 7:06 PM To: Jakarta Commons Users List Subject: Re: [HttpClient] Screen Scraping Components? Couple of components that might be of interest. http://www.osjava.org/genjava/multiproject/gj-scrape/ Firstly a library for scraping a web page. It's a wrapper around simple string manipulation aimed to let you specify what you want from the page without parsing into an XML tree, or trying to use regex. The problem with the XML tree is that it means your scraper hits too much of the page and is more instable. Scraping is about minimising the surface-area you touch to as little as possible, hopefully just the data itself. Regex's are useful for grabbing the data once you get close enough, but are not the right thing to use to walk through the tags. Gj-Scrape is a basic API for walking through a page. Secondly, an engine for scraping: http://www.osjava.org/scraping-engine/ A lot of time with scrapers is wasted writing the surrounding code. Getting the page, setting up the config in some cron'd way, putting it in a db etc. Scraping-engine is everything except for the actual parsing of the page, which you custom create using gj-scrape and plug in. It uses HttpClient for its page-grabbing, and isn't tied to scraping; I've a link-checker written using it as the framework. Grabbing the cartoon-scraping example is the best way to understand it. Hen --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
