Re: [ann] Crowbar's first milestone lands in Simile's subversion - a big step closer to a scraping crawler

Thorsten Scherler Fri, 23 Feb 2007 10:31:53 -0800

On Thu, 2007-02-22 at 20:39 -0800, Stefano Mazzocchi wrote:
...
> Replicating such an environment on a server practically meant either to:
> 
>  ...
> 
>  2) find a way to use firefox's own code on the server
> 
> Knowing how #1 requires probably years of polishing to reach the level
> that Firefox/Mozilla has reached over 8 years of development, I turned
> my attention to #2 and started working on using JavaXPCOM (a java->XPCOM
> bridge used in recent eclipse plugins for ajax support).
...
> So, I was waiting for the things on javaxpcom to solidify until
> yesterday, I had a idea: decouple the crawling logic from the actual
> page fetcher and implement a very minimal HTTP server in javascript that
> turns the web browser into a headless browsing web service.
> 
> And so I did at
> 
>  http://simile.mit.edu/repository/crowbar/trunk/


Wow. Congrats. 

I am developing ATM a generic robot framework named Droids (Stefano
knows it) and I am fascinated with this announcement since I really can
use something like this to extract links to resources and AJAX
capsulized content.

I will check this out very soon.

Thanks.

> 
> you find Crowbar: a XUL application (basically a hyper-stripped-down
> firefox) that you can execute with XULRunner (basically the XUL
> equivalent of a java virtual machine) [see the README.txt for more info]
> and that you can use from a remote machine as a fetching and DOM
> serializing web service.
> 
> Right now, it does not scrape, but it fetches the URL that you pass it
> thru a RESTful web service, it executes the javascript and builds the
> DOM, waits 3 seconds and returns you the serialization of the page DOM.
> 
> Might seem rather pointless but this is a major milestone and here is why:
> 
>  1) I'm able to obtain a serialized and guaranteed well-formed
> representation of any HTML page, no matter what complicated and no
> matter how much client side manipulation is present. This is not a way
> to use the browser's own internals instead of, say, wget, but a
> radically different approach to crawling. For example, the result of
> "wget http://maps.google.com/"; is drastically different than the crowbar
> equivalent, due to all the javascript action that happen only on the
> client side! Here, as I'm in fact using a real browser to do the
> fetching, the result is precisely the same as if you were looking at the
> page.

Meaning I can request a url via crowbar and it will return me well
formed xhtml with all Javacalls executed, right? 


...

> 
>  3) crowbar's web service will also perform query operations on the
> resulting DOM directly, for example as a way to obtain links it's
> sufficient to ask for the "//A" xpath. This will radically simplify the
> architecture of the crawling agents that will driver the fetching frontends.

Yeah! 

Meaning I can now request for link extraction something like:
local/crowbar?url=http://something.com/&xpath=//a&xpath=//A

Then crowbar will return me an Array of Dom elements that match the
xpath, right?

> 
>  4) crowbar is now automatically using all the caching mechanism that
> the browser uses.

Meaning if I use a certain instance of crowbar for 10 runs of crawling,
I do not have to worry anymore about implementing caching in my crawler
apps, since requesting the page 10 times will only request the fetching
once and 9 times from cache (if activated in the instance). Awesome.

> 
> There is still a lot of work to be done before I can see people using
> this for real, but I wanted to advertise the fact that it's now starting
> to function and we have a clear design direction that is much easier and
> solid to work with so that other interested parties might come in and
> help out.

I will check it out and will certainly come back with some questions.

Which mailing to should I use?

salu2
-- 
Thorsten Scherler                                 thorsten.at.apache.org
Open Source Java & XML                consulting, training and solutions

_______________________________________________
General mailing list
[email protected]
http://simile.mit.edu/mailman/listinfo/general

Re: [ann] Crowbar's first milestone lands in Simile's subversion - a big step closer to a scraping crawler

Reply via email to