Re: [nutch 0.5] frames

Philipp Suter Fri, 08 Jul 2005 03:29:01 -0700

Andrzej Bialecki wrote:

Have you ever thought about integrating a javascript interpreter intonutch? this could be another big step thowards a wider range ofcrawlable websites. If you need any help on this I would be very muchinterested to support anybody (timewise) implementing such afunctionality.
There is a simple JavaScript link extractor in the 0.7-dev version.I've been tinkering with a full JavaScript parser (Rhino), and foundit (most importantly) too heavyweight, and also quite incompatiblewith the way nutch works.

I will try the simple extractor and see how much of the links it covers.I thought about rhino because it could execute all the possible ways ofintegrating javascript into a dhtml site. But probably most of the casescan be covered by a much smaller subset of javascript.

Have you evaluated flash either? is it possible to parse it?
There was a contribution long time ago, from Stefan Groschupf, of aFlash text extractor. This should be brought up to date - if you havesome spare cycles we could use some help... :-)

I would have some spare cycles starting end of july until end ofaugust.. but I would need some short explanation where and how tointegrate the flash text extractor. furthermore is there any document,whatsoever explaining the nutch deign approach? I never had a look atthe sources of nutch and the design is very much tuned for performance,which does not make it easier to understand it but better to use it :-)


Cheers
philipp

Re: [nutch 0.5] frames

Reply via email to