Philipp Suter wrote:
Andrzej Bialecki wrote:

Philipp Suter wrote:

does anybody know how to crawl frames? Or how to extend nutch to be able to crawl frames? We are using the api.



The development version (available from SVN) should handle frames just fine, i.e. it should follow the src=... attributed in frames in order to retrieve the frame contents. Please download the nightly snapshot and try it out.


When do you think will it be released officially? we have some mision critical stuff running with nutch, therefore I don't know if the nightly snapshot is working for us but I'll try it out.

The release should be soon. I hoped to integrate the adaptive fetch interval patches, but there are too many issues with them. In the next two-three weeks we'll review the outstanding bugs, and I think we should roll out a new release.


Have you ever thought about integrating a javascript interpreter into nutch? this could be another big step thowards a wider range of crawlable websites. If you need any help on this I would be very much interested to support anybody (timewise) implementing such a functionality.

There is a simple JavaScript link extractor in the 0.7-dev version. I've been tinkering with a full JavaScript parser (Rhino), and found it (most importantly) too heavyweight, and also quite incompatible with the way nutch works.

Have you evaluated flash either? is it possible to parse it?

There was a contribution long time ago, from Stefan Groschupf, of a Flash text extractor. This should be brought up to date - if you have some spare cycles we could use some help... :-)

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to