On Thu, 2007-02-22 at 20:39 -0800, Stefano Mazzocchi wrote: ... > Replicating such an environment on a server practically meant either to: > > ... > > 2) find a way to use firefox's own code on the server > > Knowing how #1 requires probably years of polishing to reach the level > that Firefox/Mozilla has reached over 8 years of development, I turned > my attention to #2 and started working on using JavaXPCOM (a java->XPCOM > bridge used in recent eclipse plugins for ajax support). ... > So, I was waiting for the things on javaxpcom to solidify until > yesterday, I had a idea: decouple the crawling logic from the actual > page fetcher and implement a very minimal HTTP server in javascript that > turns the web browser into a headless browsing web service. > > And so I did at > > http://simile.mit.edu/repository/crowbar/trunk/
Wow. Congrats. I am developing ATM a generic robot framework named Droids (Stefano knows it) and I am fascinated with this announcement since I really can use something like this to extract links to resources and AJAX capsulized content. I will check this out very soon. Thanks. > > you find Crowbar: a XUL application (basically a hyper-stripped-down > firefox) that you can execute with XULRunner (basically the XUL > equivalent of a java virtual machine) [see the README.txt for more info] > and that you can use from a remote machine as a fetching and DOM > serializing web service. > > Right now, it does not scrape, but it fetches the URL that you pass it > thru a RESTful web service, it executes the javascript and builds the > DOM, waits 3 seconds and returns you the serialization of the page DOM. > > Might seem rather pointless but this is a major milestone and here is why: > > 1) I'm able to obtain a serialized and guaranteed well-formed > representation of any HTML page, no matter what complicated and no > matter how much client side manipulation is present. This is not a way > to use the browser's own internals instead of, say, wget, but a > radically different approach to crawling. For example, the result of > "wget http://maps.google.com/" is drastically different than the crowbar > equivalent, due to all the javascript action that happen only on the > client side! Here, as I'm in fact using a real browser to do the > fetching, the result is precisely the same as if you were looking at the > page. Meaning I can request a url via crowbar and it will return me well formed xhtml with all Javacalls executed, right? ... > > 3) crowbar's web service will also perform query operations on the > resulting DOM directly, for example as a way to obtain links it's > sufficient to ask for the "//A" xpath. This will radically simplify the > architecture of the crawling agents that will driver the fetching frontends. Yeah! Meaning I can now request for link extraction something like: local/crowbar?url=http://something.com/&xpath=//a&xpath=//A Then crowbar will return me an Array of Dom elements that match the xpath, right? > > 4) crowbar is now automatically using all the caching mechanism that > the browser uses. Meaning if I use a certain instance of crowbar for 10 runs of crawling, I do not have to worry anymore about implementing caching in my crawler apps, since requesting the page 10 times will only request the fetching once and 9 times from cache (if activated in the instance). Awesome. > > There is still a lot of work to be done before I can see people using > this for real, but I wanted to advertise the fact that it's now starting > to function and we have a clear design direction that is much easier and > solid to work with so that other interested parties might come in and > help out. I will check it out and will certainly come back with some questions. Which mailing to should I use? salu2 -- Thorsten Scherler thorsten.at.apache.org Open Source Java & XML consulting, training and solutions _______________________________________________ General mailing list [email protected] http://simile.mit.edu/mailman/listinfo/general
