Andrzej Bialecki wrote:



Have you ever thought about integrating a javascript interpreter into nutch? this could be another big step thowards a wider range of crawlable websites. If you need any help on this I would be very much interested to support anybody (timewise) implementing such a functionality.


There is a simple JavaScript link extractor in the 0.7-dev version. I've been tinkering with a full JavaScript parser (Rhino), and found it (most importantly) too heavyweight, and also quite incompatible with the way nutch works.

I will try the simple extractor and see how much of the links it covers. I thought about rhino because it could execute all the possible ways of integrating javascript into a dhtml site. But probably most of the cases can be covered by a much smaller subset of javascript.


Have you evaluated flash either? is it possible to parse it?


There was a contribution long time ago, from Stefan Groschupf, of a Flash text extractor. This should be brought up to date - if you have some spare cycles we could use some help... :-)

I would have some spare cycles starting end of july until end of august.. but I would need some short explanation where and how to integrate the flash text extractor. furthermore is there any document, whatsoever explaining the nutch deign approach? I never had a look at the sources of nutch and the design is very much tuned for performance, which does not make it easier to understand it but better to use it :-)

Cheers
philipp

Reply via email to