Elwin wrote:
 for example: <a href="javascript:customCss(6017162)"
 id="customCssMenu" >test</a> in fact, can nutch get content from such
 kind of urls?



Not without some drastic changes... I have an early implementation of a fetcher that uses httpunit library to actually interpret the javascript and mimick browser's behavior. The problem is that it's very slow - current fetcher implementation is stateless, the one that would support javascript needs to be stateful, and it needs to retrieve multiple resources in one go (e.g. CSS, frames, script files, the main body, etc). Then, discovering all outlinks requires a simulated "click" on all active elements, which in turn requires executing all scripts associated with all current windows. If scripts are not idempotent, you need to simulate the "Back" button, or drop/reload everything to restore previous state ...

So, it's not easy. Your best bet would be to use a separate fetcher to fetch these problematic sites, and use the standard fetcher to fetch everything else.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to