Andrzej Bialecki wrote:
Have you ever thought about integrating a javascript interpreter into
nutch? this could be another big step thowards a wider range of
crawlable websites. If you need any help on this I would be very much
interested to support anybody (timewise) implementing such a
functionality.
There is a simple JavaScript link extractor in the 0.7-dev version.
I've been tinkering with a full JavaScript parser (Rhino), and found
it (most importantly) too heavyweight, and also quite incompatible
with the way nutch works.
I will try the simple extractor and see how much of the links it covers.
I thought about rhino because it could execute all the possible ways of
integrating javascript into a dhtml site. But probably most of the cases
can be covered by a much smaller subset of javascript.
Have you evaluated flash either? is it possible to parse it?
There was a contribution long time ago, from Stefan Groschupf, of a
Flash text extractor. This should be brought up to date - if you have
some spare cycles we could use some help... :-)
I would have some spare cycles starting end of july until end of
august.. but I would need some short explanation where and how to
integrate the flash text extractor. furthermore is there any document,
whatsoever explaining the nutch deign approach? I never had a look at
the sources of nutch and the design is very much tuned for performance,
which does not make it easier to understand it but better to use it :-)
Cheers
philipp