I'll keep it on list for a bit as they're both bits that I'd like to
see at Apache.
HtmlScraper, which is the main class of use in gj-scrape is all about
pulling the desired data out of a piece of text. An xml parser, regexp
or simple string manipulation would also be of use, I just like the
fact that HtmlScraper speaks the right language for scraping a page.
It doesn't try to parse the page itself, which always makes me worry
that the surface-area of the scraper is too large.
The scraping-engine on the other hand does everything but scrape the
actual page. You extend a Parser object, add a bit of configuration
and start running it. When you extend the Parser object, you could use
regexp, xml parsing or HtmlScraper.
(Oops, I realise the confusion).
The examples in scraping-engine extend a custom version of Parser
called UrlScraper (nice of me to switch names eh?) which goes ahead
and sets you up to use a HtmlScraper by default. It makes the actual
implemention very simple; for example User Friendly's code is:
scraper.moveToTagWith("ALT", "Latest Strip");
return scraper.get("IMG[SRC]");
UrlScraper assumes that the result of the parse will be a URL in
String form, which can then be configured to be stored in a file.
For data, you extend AbstractParser and implement:
public Result parse(Page page, Config cfg, Session session) throws
ParsingException;
UrlScraper;
http://svn.osjava.org/cgi-bin/viewcvs.cgi/trunk/scraping-engine/src/java/org/osjava/scraping/parser/UrlScraper.java?rev=1150&view=auto
is the only online example of this.
Anyways. Got to take a baby to the doctor for a checkup. How did all that sound?
Hen
On Mon, 22 Nov 2004 06:46:00 -0500, Brant Hahn <[EMAIL PROTECTED]> wrote:
> Hi Henri,
>
> Sorry for the extra email here. I think I am also particularly interested
> in differentiating between the Page, Fetchers, Parsers, and Scrapers. I'm
> just trying to distinguish who does what. I think my primary confusion goes
> between HttpFetcher and HtmlScraper. What exactly is the difference and in
> what situations would I use one over the other.
>
> BTW, besides some of the minor confusion I have, I feel that this is what
> I've been looking for. I'm primarily wanting to use it to log-in to my
> financial accounts and to read my monthly statements that are online. I
> definteily appreciate you informing me about your components. Your help is
> greatly appreciated!
>
> -Brant
>
>
>
> -----Original Message-----
> From: Henri Yandell [mailto:[EMAIL PROTECTED]
> Sent: Saturday, November 20, 2004 7:06 PM
> To: Jakarta Commons Users List
> Subject: Re: [HttpClient] Screen Scraping Components?
>
> Couple of components that might be of interest.
>
> http://www.osjava.org/genjava/multiproject/gj-scrape/
>
> Firstly a library for scraping a web page. It's a wrapper around
> simple string manipulation aimed to let you specify what you want from
> the page without parsing into an XML tree, or trying to use regex. The
> problem with the XML tree is that it means your scraper hits too much
> of the page and is more instable. Scraping is about minimising the
> surface-area you touch to as little as possible, hopefully just the
> data itself.
>
> Regex's are useful for grabbing the data once you get close enough,
> but are not the right thing to use to walk through the tags. Gj-Scrape
> is a basic API for walking through a page.
>
> Secondly, an engine for scraping:
>
> http://www.osjava.org/scraping-engine/
>
> A lot of time with scrapers is wasted writing the surrounding code.
> Getting the page, setting up the config in some cron'd way, putting it
> in a db etc. Scraping-engine is everything except for the actual
> parsing of the page, which you custom create using gj-scrape and plug
> in.
>
> It uses HttpClient for its page-grabbing, and isn't tied to scraping;
> I've a link-checker written using it as the framework. Grabbing the
> cartoon-scraping example is the best way to understand it.
>
> Hen
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]