Re: [ann] Crowbar's first milestone lands in Simile's subversion - a big step closer to a scraping crawler

Johan Sundström Fri, 23 Feb 2007 04:00:07 -0800

On 2/23/07, Stefano Mazzocchi <[EMAIL PROTECTED]> wrote:
> Ever since working on Solvent and Piggy Bank, we have been toying with
> the idea of using the same javascript scrapers to power a server-side
> headless crawling agent that could perform data extraction and scraping
> in a more automated way.


Count me in. :-)

> Right now, it does not scrape, but it fetches the URL that you pass it
> thru a RESTful web service, it executes the javascript and builds the
> DOM, waits 3 seconds and returns you the serialization of the page DOM.

For the lack of a reliable "(javascript) rendering done" callback, I presume?

>  3) crowbar's web service will also perform query operations on the
> resulting DOM directly, for example as a way to obtain links it's
> sufficient to ask for the "//A" xpath. This will radically simplify the
> architecture of the crawling agents that will driver the fetching frontends.

Nice. More info on how to (or where-to-rtfm) would be much appreciated. :-)

> There is still a lot of work to be done before I can see people using
> this for real, but I wanted to advertise the fact that it's now starting
> to function and we have a clear design direction that is much easier and
> solid to work with so that other interested parties might come in and
> help out.

I'm fairly sure I'll join in, here too. Feel free to drop over to the
[EMAIL PROTECTED] as and when you find it appropriate.

-- 
 / Johan Sundström, http://ecmanaut.blogspot.com/

_______________________________________________
General mailing list
[email protected]
http://simile.mit.edu/mailman/listinfo/general

Re: [ann] Crowbar's first milestone lands in Simile's subversion - a big step closer to a scraping crawler

Reply via email to