Re: [nutch 0.5] frames

Andrzej Bialecki Fri, 08 Jul 2005 02:14:11 -0700

Philipp Suter wrote:

Andrzej Bialecki wrote:
Philipp Suter wrote:
does anybody know how to crawl frames? Or how to extend nutch to beable to crawl frames? We are using the api.
The development version (available from SVN) should handle frames justfine, i.e. it should follow the src=... attributed in frames in orderto retrieve the frame contents. Please download the nightly snapshotand try it out.
When do you think will it be released officially? we have some misioncritical stuff running with nutch, therefore I don't know if the nightlysnapshot is working for us but I'll try it out.

The release should be soon. I hoped to integrate the adaptive fetchinterval patches, but there are too many issues with them. In the nexttwo-three weeks we'll review the outstanding bugs, and I think we shouldroll out a new release.

Have you ever thought about integrating a javascript interpreter intonutch? this could be another big step thowards a wider range ofcrawlable websites. If you need any help on this I would be very muchinterested to support anybody (timewise) implementing such a functionality.

There is a simple JavaScript link extractor in the 0.7-dev version. I'vebeen tinkering with a full JavaScript parser (Rhino), and found it (mostimportantly) too heavyweight, and also quite incompatible with the waynutch works.

Have you evaluated flash either? is it possible to parse it?

There was a contribution long time ago, from Stefan Groschupf, of aFlash text extractor. This should be brought up to date - if you havesome spare cycles we could use some help... :-)


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: [nutch 0.5] frames

Reply via email to