Ryan Perry wrote:
On Jun 16, 2006, at 2:38 PM, Peter Stevens wrote:
Yes and no. AJAX is pretty heavily into Java script - you'll be
spending a lot of time with Tamper Data to figure out what data is
being sent back and forth. You can look at and reverse engineer the
java script.
Can it be done? Certainly. Is it easy? No. Is it practical? Well that
depends on the complexity of the javascript environment and how much
time & energy you have for the problem.
We need a Javascript + DOM implementation on top of Mechanize!
I strongly agree. It's the future and we need to be there. How would
one go about this? I'd be happy to contribute. Where do I start?
Thanks!
Well, this may get me fried, but I do feel compelled to comment because
my day to day job is to create code to do web scraping on some extremely
complex web sites, many of which use a lot of AJAX and DHTML (Javascript
and *shudder* VBS).
My first web scraper was written in Perl/Mechanize, and we managed to
get past the DHTML portions by decoding the Javascript and returning
responses that took into account the Javascript that *would* have run.
It ran, but took over 3 months to write, and THAT site was well written,
with nice ID's and no AJAX. I still had to assume a few page locations
and do blind posts because of complex Javascript that did page redirects
to pages who's Javascript did MORE page re-directs.
After doing some research on upcoming projects scraping sites using AJAX
and more complex Javascript I found a tool called Watir
(http://openqa.org/watir/). Watir is a library written in Ruby (which
we were already starting to use) and gets around the whole
Javascript/AJAX issues by automating use of Internet Explorer so all
scripts actually run. While having to run it under Windows and only
having support for IE was the downside, the upside has been that our
scrapers have been much easier to write, and the last one I did took
about two weeks, and it does a lot more than our first one.
One thing that still gave us problems until recently was the issue of
pop-up windows. While Watir had a rather crude way of clicking past a
pop-up window if you knew it was coming, modal dialogs were still hard
to automate because they can have any valid HTML, but the IE "click"
would block until the dialog closed, and there was no way to attach to
the modal dialog and get access to the DOM if you did the click in
another thread/process. Finally someone put together an intricate
method to attach to a modal dialog window by using the current IE's
HWND, and then link the pointers together to get access to the modal
dialog's DOM.
Now, I can automate a modal dialog window as easily as a normal browser
window. Here's the code to fire up IE to a page, click a button which
brings up a modal dialog, attach to that dialog, fill a text box on the
dialog, click on the dialog close box, and retrieve the entered value
from the original window. All in the following few lines of Ruby/Watir
code:
require 'watir'
include Watir
ie = IE.new
ie.goto('http://SITE/modal_dialog_launcher.html')
ie.button(:value, 'Launch Dialog').click_no_wait
ie.modal_dialog.text_field(:name, 'modal_text').set('hello')
ie.modal_dialog.button(:value, 'Close').click
modal_text = ie.text_field(:name, 'modaloutput').value
ie.close
That is code I just executed in Ruby's interactive shell (IRB) on one of
the HTML files in the Watir unit test suite.
Now, to get that functionality you need to check out the latest
developer versions using SVN as I just added the modal_dialog
functionality recently, but it does work. (I just put together the
pieces I found in a number of places to get that functionality, so I
can't take much credit, but I did submit it to the Watir project.)
There is a project called FireWatir which is aimed at using Firefox
(under any O/S that Firefox runs under), but it's still lagging a bit
behind, and performance is still very poor from what I've heard. But
there is hope.
For now, I'd recommend checking out Watir for your web automation
projects, if you can get away with using IE under Windows.
David Schmidt
[EMAIL PROTECTED]