Re: Java script FAQ [was Re: :Mechanize]

Mike Schilli Thu, 06 Apr 2006 06:22:51 -0700

On Wed, 5 Apr 2006, Peter Stevens wrote:

>     ... and until someone writes a Javascript interpreter for
>     Perl or a Mechanize clone to control Firefox, there will be no
>     general solution.


That's actually not quite accurate.

There *is* a JavaScript interpreter for perl (JavaScript::SpiderMonkey)
on CPAN.

The problem is not the interpreter. JavaScript::SpiderMonkey lets you
run arbitrary JavaScript code in Perl, pass parameters from perl to
JavaScript and call back into perl from JavaScript.

The problem is the browser DOM, which a browser's JavaScript interpreter
has pre-loaded. The different HTML parts are pre-loaded JavaScript objects
and methods like OnClick() are predefined.

As soon as someone gets going and comes up with a reference implementation
(every browser naturally has its own DOM implementation, that's why IE
and Firefox behave differently at times), WWW::Mech is in business.

How cool would that be!

-- Mike

Mike Schilli
[EMAIL PROTECTED]

> But if you want to scrape specific pages, then a
>     solution is always possible.
>
>     One typical use of Javascript is to perform argument checking before
>     posting to the server. The URL you want is probably just buried in
>     the Javascript function. Do a regular expression match on
>     |$mech->content()| to find the link that you want and |$mech->get|
>     it directly (this assumes that you know what your are looking for in
>     advance).
>
>     In more difficult cases, the Javascript is used for URL mangling to
>     satisfy the needs of some middleware. In this case you need to
>     figure out what the Javascript is doing (why are these URLs always
>     really long?). There is probably some function with one or more
>     arguments which calculates the new URL. Step one: using your
>     favorite browser, get the before and after URLs and save them to
>     files. Edit each file, converting the the argument separators ('?',
>     '&' or ';') into newlines. Now it is easy to use diff or comm to
>     find out what Javascript did to the URL. Step 2 - find the function
>     call which created the URL - you will need to parse and interpret
>     its argument list. Using the Javascript Debugger Extension for
>     Firefox may help with the analysis. At this point, it is fairly
>     trivial to write your own function which emulates the Javascript for
>     the pages you want to process.
>
> Please append to it:
>
>     An Alternative Approach (this is also an answer to the question, "It
>     works in Firefox, why not in $mech?" )
>
>     Everything the web server knows about the client is present in the
>     HTTP request. If two requests are identical, the results should be
>     identical. So the real question is "What is different between the
>     mech request and the Firefox request?"
>
>     I would suggest using the Firefox extension "Tamper Data" to look at
>     the headers of the requests you send to the server. Compare that
>     with what LWP is sending. Once the two are identical, the action of
>     the server should be the same as well.
>
>     I say "should", because this is an oversimplification - some values
>     are naturally unique, e.g. a SessionID, but if a SessionID is
>     present, that is probably sufficient, even though the value will be
>     different between the LWP request and the Firefox request. The
>     server could use the session to store information which is
>     troublesome, but that's not the first place to look (and highly
>     unlike to be relevant when you are requesting the login page of your
>     site).
>
>     Generally the problem is to be found in missing or incorrect
>     POSTDATA arguments, Cookies, User-Agents, Accepts, etc. If you are
>     using mech, then redirects and cookies should not be a problem, but
>     are listed here for completeness. If you are missing headers,
>     $mech->add_header can be used to add the headers that you need.
>
> Is there a preferred way to get the request which mech is going to send?
> I was able to get it by following the code into the innards of
> HTTP::Request, but that seems like the kind of stuff a $mechanize user
> won't want to do.
>
> Cheers,
>
> Peter
>
>
>
> Cahoon, Forrest wrote:
> > If you're specifically looking at Yahoo! Mail, there's at least one CPAN 
> > module for that:
> > http://search.cpan.org/~johnsca/MailClientYahoo-1.0/lib/Mail/Client/Yahoo.pm
> >
> > If it's just something similar to Yahoo!, perhaps that code will give you 
> > some clues.
> > (I haven't used that module myself, just happened to notice it's existence.)
> >
> > Forrest
> > not speaking for merrill corporation
> >
> >
> >> -----Original Message-----
> >> From: Roy Lor [mailto:[EMAIL PROTECTED]
> >> Sent: Tuesday, April 04, 2006 8:21 AM
> >> To: libwww@perl.org
> >> Subject: WWW::Mechanize
> >>
> >> can u give me a code/script that records the information in a
> >> log-in  form with javascript..like that of mail.yahoo.com? i
> >> badly need  this..thanks
> >>
> >>
> >> ---------------------------------
> >> On Yahoo!7
> >>   Desperate Housewives: Sneak peeks, recaps and more.
> >>
> >>
> >
> >
> >
>

Re: Java script FAQ [was Re: :Mechanize]

Reply via email to