Use Web-Harvest as the Forrest2 Crawler
---------------------------------------
Key: FOR-951
URL: http://issues.apache.org/jira/browse/FOR-951
Project: Forrest
Issue Type: Improvement
Components: Forrest2
Reporter: Ross Gardler
One of the important parts of Cocoon that are actually needed in Forrest is the
crawler. I've looked at using the Cocoon crawler in isolation, but it looks
like to much work extracting it. So, I looked for alternatives...
"Web-Harvest is Open Source Web Data Extraction tool written in Java. It offers
a way to collect desired Web pages and extract useful data from them. In order
to do that, it leverages well established techniques and technologies for
text/xml manipulation such as XSLT, XQuery and Regular Expressions."
[http://web-harvest.sourceforge.net/index.php]
Web-Harvest can perform two very useful functions in te core of Forrest2:
1 - as a Forrest 2 conten object crawler, in this case the data extracted is
the complete generated page
2 - as a customisable reader that extracts data from external HTML pages for us
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira