[jira] Commented: (FOR-951) Use Web-Harvest as the Forrest2 Crawler

David Crossley (JIRA) Thu, 28 Dec 2006 21:44:49 -0800

    [ 
http://issues.apache.org/jira/browse/FOR-951?page=comments#action_12461308 ] 
            
David Crossley commented on FOR-951:
------------------------------------


The Cocoon CLI crawler has many abilities. Does Web-Harvest cover them?

http://cocoon.apache.org/2.1/userdocs/offline/
http://wiki.apache.org/cocoon/CommandLine

Here are some of the important abilities (there are probably others too):

* Gathers links from each crawled page and adds them to the linkmap if not 
already seen.

* Gathers links from generated pages. So new navigation menu links are also 
crawled.

* Gathers links from other pages (not only html) e.g. css

* Maintains a list of "already seen" entries so that it doesn't crawl or 
generate them more than once.

* Enables URIs to be excluded (declared via URI patterns in cli.xconf).

* Enables extra URIs to be included.

* Defines special handling for groups of URIs (e.g. where and what name for the 
generated URI) and how the generated URI should be treated (i.e. 
append|replace|insert).

* Checks the mime-type for the generated page and adjusts filename and links 
extensions to match the mime-type (e.g. text/html->.html).

* Creates a checksum file to record the generated pages.


> Use Web-Harvest as the Forrest2 Crawler
> ---------------------------------------
>
>                 Key: FOR-951
>                 URL: http://issues.apache.org/jira/browse/FOR-951
>             Project: Forrest
>          Issue Type: Improvement
>          Components: Forrest2
>            Reporter: Ross Gardler
>
> One of the important parts of Cocoon that are actually needed in Forrest is 
> the crawler. I've looked at using the Cocoon crawler in isolation, but it 
> looks like to much work extracting it. So, I looked for alternatives...
> "Web-Harvest is Open Source Web Data Extraction tool written in Java. It 
> offers a way to collect desired Web pages and extract useful data from them. 
> In order to do that, it leverages well established techniques and 
> technologies for text/xml manipulation such as XSLT, XQuery and Regular 
> Expressions." [http://web-harvest.sourceforge.net/index.php]
> Web-Harvest can perform two very useful functions in te core of Forrest2:
> 1 - as a Forrest 2 conten object crawler, in this case the data extracted is 
> the complete generated page
> 2 - as a customisable reader that extracts data from external HTML pages for 
> us

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (FOR-951) Use Web-Harvest as the Forrest2 Crawler

Reply via email to