[ http://issues.apache.org/jira/browse/FOR-951?page=comments#action_12461308 ] David Crossley commented on FOR-951: ------------------------------------
The Cocoon CLI crawler has many abilities. Does Web-Harvest cover them? http://cocoon.apache.org/2.1/userdocs/offline/ http://wiki.apache.org/cocoon/CommandLine Here are some of the important abilities (there are probably others too): * Gathers links from each crawled page and adds them to the linkmap if not already seen. * Gathers links from generated pages. So new navigation menu links are also crawled. * Gathers links from other pages (not only html) e.g. css * Maintains a list of "already seen" entries so that it doesn't crawl or generate them more than once. * Enables URIs to be excluded (declared via URI patterns in cli.xconf). * Enables extra URIs to be included. * Defines special handling for groups of URIs (e.g. where and what name for the generated URI) and how the generated URI should be treated (i.e. append|replace|insert). * Checks the mime-type for the generated page and adjusts filename and links extensions to match the mime-type (e.g. text/html->.html). * Creates a checksum file to record the generated pages. > Use Web-Harvest as the Forrest2 Crawler > --------------------------------------- > > Key: FOR-951 > URL: http://issues.apache.org/jira/browse/FOR-951 > Project: Forrest > Issue Type: Improvement > Components: Forrest2 > Reporter: Ross Gardler > > One of the important parts of Cocoon that are actually needed in Forrest is > the crawler. I've looked at using the Cocoon crawler in isolation, but it > looks like to much work extracting it. So, I looked for alternatives... > "Web-Harvest is Open Source Web Data Extraction tool written in Java. It > offers a way to collect desired Web pages and extract useful data from them. > In order to do that, it leverages well established techniques and > technologies for text/xml manipulation such as XSLT, XQuery and Regular > Expressions." [http://web-harvest.sourceforge.net/index.php] > Web-Harvest can perform two very useful functions in te core of Forrest2: > 1 - as a Forrest 2 conten object crawler, in this case the data extracted is > the complete generated page > 2 - as a customisable reader that extracts data from external HTML pages for > us -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
