Use Web-Harvest as the Forrest2 Crawler
---------------------------------------

                 Key: FOR-951
                 URL: http://issues.apache.org/jira/browse/FOR-951
             Project: Forrest
          Issue Type: Improvement
          Components: Forrest2
            Reporter: Ross Gardler


One of the important parts of Cocoon that are actually needed in Forrest is the 
crawler. I've looked at using the Cocoon crawler in isolation, but it looks 
like to much work extracting it. So, I looked for alternatives...

"Web-Harvest is Open Source Web Data Extraction tool written in Java. It offers 
a way to collect desired Web pages and extract useful data from them. In order 
to do that, it leverages well established techniques and technologies for 
text/xml manipulation such as XSLT, XQuery and Regular Expressions." 
[http://web-harvest.sourceforge.net/index.php]

Web-Harvest can perform two very useful functions in te core of Forrest2:

1 - as a Forrest 2 conten object crawler, in this case the data extracted is 
the complete generated page

2 - as a customisable reader that extracts data from external HTML pages for us



-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira