Lewis John McGibbney created NUTCH-1599:
-------------------------------------------

             Summary: Obtain consensus on new description of Nutch
                 Key: NUTCH-1599
                 URL: https://issues.apache.org/jira/browse/NUTCH-1599
             Project: Nutch
          Issue Type: Improvement
          Components: documentation
            Reporter: Lewis John McGibbney


As we seem to be sustaining pushes and maintenance (touch wood) of two 
branches, I think it is about time we agreed on a more accurate description of 
what Nutch actually is.

We currently have (taken directly from our site)

{code:xml}
Apache Nutch is an open source web-search software project. Stemming from 
Apache Lucene, it now builds on Apache Solr adding web-specifics, such as a 
crawler, a link-graph database and parsing support handled by Apache Tika for 
HTML and and array other document formats.

Nutch can run on a single machine, but gains a lot of its strength from running 
in a Hadoop cluster

The system can be enhanced (eg other document formats can be parsed) using a 
highly flexible, easily extensible and thoroughly maintained plugin 
infrastructure.
{code}

I suggest/propose something along the lines of

{code:xml}
Apache Nutch is an open source web-search software project. Stemming from 
Apache Lucene, the community now develops and maintains two branches:

* 1.x; description of 1.x here

* 2.x; description of 2.x here

Both branches add web-specifics, such as a crawler, a link-graph database and 
parsing support handled by Apache Tika for HTML and anarray other document 
formats.

Nutch can run on a single machine, but gains a lot of its strength from running 
in a Hadoop cluster

The system can be enhanced (eg other document formats can be parsed) using a 
highly flexible, easily extensible and thoroughly maintained plugin 
infrastructure.
{code}

Any thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to