Andrzej Bialecki wrote:
Have you ever thought about integrating a javascript interpreter into
nutch? this could be another big step thowards a wider range of
crawlable websites. If you need any help on this I would be very much
interested to support anybody (timewise) implementing such a
functionality.
There is a simple JavaScript link extractor in the 0.7-dev version.
I've been tinkering with a full JavaScript parser (Rhino), and found
it (most importantly) too heavyweight, and also quite incompatible
with the way nutch works.
I will try the simple extractor and see how much of the links it covers.
I thought about rhino because it could execute all the possible ways of
integrating javascript into a dhtml site. But probably most of the cases
can be covered by a much smaller subset of javascript.
Have you evaluated flash either? is it possible to parse it?
There was a contribution long time ago, from Stefan Groschupf, of a
Flash text extractor. This should be brought up to date - if you have
some spare cycles we could use some help... :-)
I would have some spare cycles starting end of july until end of
august.. but I would need some short explanation where and how to
integrate the flash text extractor. furthermore is there any document,
whatsoever explaining the nutch deign approach? I never had a look at
the sources of nutch and the design is very much tuned for performance,
which does not make it easier to understand it but better to use it :-)
Cheers
philipp
-------------------------------------------------------
This SF.Net email is sponsored by the 'Do More With Dual!' webinar happening
July 14 at 8am PDT/11am EDT. We invite you to explore the latest in dual
core and dual graphics technology at this free one hour event hosted by HP,
AMD, and NVIDIA. To register visit http://www.hp.com/go/dualwebinar
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general