Re: [HttpClient] Screen Scraping Components?

Henri Yandell Sat, 20 Nov 2004 16:05:57 -0800

Couple of components that might be of interest.

http://www.osjava.org/genjava/multiproject/gj-scrape/


Firstly a library for scraping a web page. It's a wrapper around
simple string manipulation aimed to let you specify what you want from
the page without parsing into an XML tree, or trying to use regex. The
problem with the XML tree is that it means your scraper hits too much
of the page and is more instable. Scraping is about minimising the
surface-area you touch to as little as possible, hopefully just the
data itself.

Regex's are useful for grabbing the data once you get close enough,
but are not the right thing to use to walk through the tags. Gj-Scrape
is a basic API for walking through a page.

Secondly, an engine for scraping:

http://www.osjava.org/scraping-engine/

A lot of time with scrapers is wasted writing the surrounding code.
Getting the page, setting up the config in some cron'd way, putting it
in a db etc. Scraping-engine is everything except for the actual
parsing of the page, which you custom create using gj-scrape and plug
in.

It uses HttpClient for its page-grabbing, and isn't tied to scraping;
I've a link-checker written using it as the framework. Grabbing the
cartoon-scraping example is the best way to understand it.

Hen

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [HttpClient] Screen Scraping Components?

Reply via email to