Lewis John McGibbney created NUTCH-1599:
-------------------------------------------
Summary: Obtain consensus on new description of Nutch
Key: NUTCH-1599
URL: https://issues.apache.org/jira/browse/NUTCH-1599
Project: Nutch
Issue Type: Improvement
Components: documentation
Reporter: Lewis John McGibbney
As we seem to be sustaining pushes and maintenance (touch wood) of two
branches, I think it is about time we agreed on a more accurate description of
what Nutch actually is.
We currently have (taken directly from our site)
{code:xml}
Apache Nutch is an open source web-search software project. Stemming from
Apache Lucene, it now builds on Apache Solr adding web-specifics, such as a
crawler, a link-graph database and parsing support handled by Apache Tika for
HTML and and array other document formats.
Nutch can run on a single machine, but gains a lot of its strength from running
in a Hadoop cluster
The system can be enhanced (eg other document formats can be parsed) using a
highly flexible, easily extensible and thoroughly maintained plugin
infrastructure.
{code}
I suggest/propose something along the lines of
{code:xml}
Apache Nutch is an open source web-search software project. Stemming from
Apache Lucene, the community now develops and maintains two branches:
* 1.x; description of 1.x here
* 2.x; description of 2.x here
Both branches add web-specifics, such as a crawler, a link-graph database and
parsing support handled by Apache Tika for HTML and anarray other document
formats.
Nutch can run on a single machine, but gains a lot of its strength from running
in a Hadoop cluster
The system can be enhanced (eg other document formats can be parsed) using a
highly flexible, easily extensible and thoroughly maintained plugin
infrastructure.
{code}
Any thoughts?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira