Philipp Suter wrote:
Andrzej Bialecki wrote:
Philipp Suter wrote:
does anybody know how to crawl frames? Or how to extend nutch to be
able to crawl frames? We are using the api.
The development version (available from SVN) should handle frames just
fine, i.e. it should follow the src=... attributed in frames in order
to retrieve the frame contents. Please download the nightly snapshot
and try it out.
When do you think will it be released officially? we have some mision
critical stuff running with nutch, therefore I don't know if the nightly
snapshot is working for us but I'll try it out.
The release should be soon. I hoped to integrate the adaptive fetch
interval patches, but there are too many issues with them. In the next
two-three weeks we'll review the outstanding bugs, and I think we should
roll out a new release.
Have you ever thought about integrating a javascript interpreter into
nutch? this could be another big step thowards a wider range of
crawlable websites. If you need any help on this I would be very much
interested to support anybody (timewise) implementing such a functionality.
There is a simple JavaScript link extractor in the 0.7-dev version. I've
been tinkering with a full JavaScript parser (Rhino), and found it (most
importantly) too heavyweight, and also quite incompatible with the way
nutch works.
Have you evaluated flash either? is it possible to parse it?
There was a contribution long time ago, from Stefan Groschupf, of a
Flash text extractor. This should be brought up to date - if you have
some spare cycles we could use some help... :-)
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com