[
https://issues.apache.org/jira/browse/NUTCH-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13698027#comment-13698027
]
Julien Nioche commented on NUTCH-1599:
--------------------------------------
I usually describe Nutch as a 'web crawler' instead of 'web-search software
project' as the latter dates back from when we used to handle indexing and
search within Nutch. We are now focusing purely on the crawling which is a good
thing.
Explaining that there are 2 versions and briefly what the differences are is a
good idea.
I wouldn't give the document format parsing as an example of what we gain from
being modular and pluggable but instead mention that bespoke extractors
(ParsingFilters) can be implemented for specific extraction tasks or mention
the pluggable indexer.
> Obtain consensus on new description of Nutch
> --------------------------------------------
>
> Key: NUTCH-1599
> URL: https://issues.apache.org/jira/browse/NUTCH-1599
> Project: Nutch
> Issue Type: Improvement
> Components: documentation
> Reporter: Lewis John McGibbney
> Assignee: Lewis John McGibbney
> Fix For: 2.3, 1.8
>
>
> As we seem to be sustaining pushes and maintenance (touch wood) of two
> branches, I think it is about time we agreed on a more accurate description
> of what Nutch actually is.
> We currently have (taken directly from our site)
> {code:xml}
> Apache Nutch is an open source web-search software project. Stemming from
> Apache Lucene, it now builds on Apache Solr adding web-specifics, such as a
> crawler, a link-graph database and parsing support handled by Apache Tika for
> HTML and and array other document formats.
> Nutch can run on a single machine, but gains a lot of its strength from
> running in a Hadoop cluster
> The system can be enhanced (eg other document formats can be parsed) using a
> highly flexible, easily extensible and thoroughly maintained plugin
> infrastructure.
> {code}
> I suggest/propose something along the lines of
> {code:xml}
> Apache Nutch is an open source web-search software project. Stemming from
> Apache Lucene, the community now develops and maintains two branches:
> * 1.x; description of 1.x here
> * 2.x; description of 2.x here
> Both branches add web-specifics, such as a crawler, a link-graph database and
> parsing support handled by Apache Tika for HTML and anarray other document
> formats.
> Nutch can run on a single machine, but gains a lot of its strength from
> running in a Hadoop cluster
> The system can be enhanced (eg other document formats can be parsed) using a
> highly flexible, easily extensible and thoroughly maintained plugin
> infrastructure.
> {code}
> Any thoughts?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira