Couple of components that might be of interest. http://www.osjava.org/genjava/multiproject/gj-scrape/
Firstly a library for scraping a web page. It's a wrapper around simple string manipulation aimed to let you specify what you want from the page without parsing into an XML tree, or trying to use regex. The problem with the XML tree is that it means your scraper hits too much of the page and is more instable. Scraping is about minimising the surface-area you touch to as little as possible, hopefully just the data itself. Regex's are useful for grabbing the data once you get close enough, but are not the right thing to use to walk through the tags. Gj-Scrape is a basic API for walking through a page. Secondly, an engine for scraping: http://www.osjava.org/scraping-engine/ A lot of time with scrapers is wasted writing the surrounding code. Getting the page, setting up the config in some cron'd way, putting it in a db etc. Scraping-engine is everything except for the actual parsing of the page, which you custom create using gj-scrape and plug in. It uses HttpClient for its page-grabbing, and isn't tied to scraping; I've a link-checker written using it as the framework. Grabbing the cartoon-scraping example is the best way to understand it. Hen --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
