[Nutch-general] Re: How to deal with javascript urls?

Andrzej Bialecki Mon, 24 Apr 2006 11:46:04 -0700

Elwin wrote:

 for example: <a href="javascript:customCss(6017162)"
 id="customCssMenu" >test</a> in fact, can nutch get content from such
 kind of urls?

Not without some drastic changes... I have an early implementation of afetcher that uses httpunit library to actually interpret the javascriptand mimick browser's behavior. The problem is that it's very slow -current fetcher implementation is stateless, the one that would supportjavascript needs to be stateful, and it needs to retrieve multipleresources in one go (e.g. CSS, frames, script files, the main body,etc). Then, discovering all outlinks requires a simulated "click" on allactive elements, which in turn requires executing all scripts associatedwith all current windows. If scripts are not idempotent, you need tosimulate the "Back" button, or drop/reload everything to restoreprevious state ...

So, it's not easy. Your best bet would be to use a separate fetcher tofetch these problematic sites, and use the standard fetcher to fetcheverything else.


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: How to deal with javascript urls?

Reply via email to