Hi Henri,

Sorry for the extra email here.  I think I am also particularly interested
in differentiating between the Page, Fetchers, Parsers, and Scrapers.  I'm
just trying to distinguish who does what.  I think my primary confusion goes
between HttpFetcher and HtmlScraper.  What exactly is the difference and in
what situations would I use one over the other.

BTW, besides some of the minor confusion I have, I feel that this is what
I've been looking for.  I'm primarily wanting to use it to log-in to my
financial accounts and to read my monthly statements that are online.  I
definteily appreciate you informing me about your components.  Your help is
greatly appreciated!

-Brant

-----Original Message-----
From: Henri Yandell [mailto:[EMAIL PROTECTED] 
Sent: Saturday, November 20, 2004 7:06 PM
To: Jakarta Commons Users List
Subject: Re: [HttpClient] Screen Scraping Components?

Couple of components that might be of interest.

http://www.osjava.org/genjava/multiproject/gj-scrape/

Firstly a library for scraping a web page. It's a wrapper around
simple string manipulation aimed to let you specify what you want from
the page without parsing into an XML tree, or trying to use regex. The
problem with the XML tree is that it means your scraper hits too much
of the page and is more instable. Scraping is about minimising the
surface-area you touch to as little as possible, hopefully just the
data itself.

Regex's are useful for grabbing the data once you get close enough,
but are not the right thing to use to walk through the tags. Gj-Scrape
is a basic API for walking through a page.

Secondly, an engine for scraping:

http://www.osjava.org/scraping-engine/

A lot of time with scrapers is wasted writing the surrounding code.
Getting the page, setting up the config in some cron'd way, putting it
in a db etc. Scraping-engine is everything except for the actual
parsing of the page, which you custom create using gj-scrape and plug
in.

It uses HttpClient for its page-grabbing, and isn't tied to scraping;
I've a link-checker written using it as the framework. Grabbing the
cartoon-scraping example is the best way to understand it.

Hen

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to