[Nutch-general] Re: [nutch 0.5] frames

Philipp Suter Fri, 08 Jul 2005 03:31:01 -0700

Andrzej Bialecki wrote:

Have you ever thought about integrating a javascript interpreter intonutch? this could be another big step thowards a wider range ofcrawlable websites. If you need any help on this I would be very muchinterested to support anybody (timewise) implementing such afunctionality.
There is a simple JavaScript link extractor in the 0.7-dev version.I've been tinkering with a full JavaScript parser (Rhino), and foundit (most importantly) too heavyweight, and also quite incompatiblewith the way nutch works.

I will try the simple extractor and see how much of the links it covers.I thought about rhino because it could execute all the possible ways ofintegrating javascript into a dhtml site. But probably most of the casescan be covered by a much smaller subset of javascript.

Have you evaluated flash either? is it possible to parse it?
There was a contribution long time ago, from Stefan Groschupf, of aFlash text extractor. This should be brought up to date - if you havesome spare cycles we could use some help... :-)

I would have some spare cycles starting end of july until end ofaugust.. but I would need some short explanation where and how tointegrate the flash text extractor. furthermore is there any document,whatsoever explaining the nutch deign approach? I never had a look atthe sources of nutch and the design is very much tuned for performance,which does not make it easier to understand it but better to use it :-)


Cheers
philipp


-------------------------------------------------------
This SF.Net email is sponsored by the 'Do More With Dual!' webinar happening
July 14 at 8am PDT/11am EDT. We invite you to explore the latest in dual

core and dual graphics technology at this free one hour event hosted by HP,AMD, and NVIDIA. To register visit http://www.hp.com/go/dualwebinar

_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: [nutch 0.5] frames

Reply via email to