[Nutch Wiki] Update of "Nutch2Roadmap" by JulienNioche

Apache Wiki Mon, 04 Jul 2011 00:47:27 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "Nutch2Roadmap" page has been changed by JulienNioche:
http://wiki.apache.org/nutch/Nutch2Roadmap?action=diff&rev1=3&rev2=4

  = Nutch2Roadmap =
- 
  Here is a list of the features and architectural changes that will be 
implemented in Nutch 2.0.
  
-   * Storage Abstraction
+  * Storage Abstraction<<BR>>
-     * initially with back end implementations for HBase and HDFS  
+   * initially with back end implementations for HBase and HDFS
-     * extend it to other storages later e.g. MySQL etc...
+   * extend it to other storages later e.g. MySQL etc...
-   * Plugin cleanup : Tika only for parsing document formats (see 
http://wiki.apache.org/nutch/TikaPlugin)
+  * Plugin cleanup : Tika only for parsing document formats (see 
http://wiki.apache.org/nutch/TikaPlugin)
-     * keep only stuff HtmlParseFilters (probably with a different API) so 
that we can post-process the DOM created in Tika from whatever original format.
+   * keep only stuff HtmlParseFilters (probably with a different API) so that 
we can post-process the DOM created in Tika from whatever original format.
+   * Modify code so that parser can generate multiple documents which is what 
1.x does but not 2.0
-   * Externalize functionalities to crawler-commons project 
[http://code.google.com/p/crawler-commons/] 
+  * Externalize functionalities to crawler-commons project 
[http://code.google.com/p/crawler-commons/]
-     * robots handling, url filtering and url normalization, URL state 
management, perhaps deduplication. We should coordinate our efforts, and share 
code freely so that other projects (bixo, heritrix,droids) may contribute to 
this shared pool of functionality, much like Tika does for the common need of 
parsing complex formats.
+   * robots handling, url filtering and url normalization, URL state 
management, perhaps deduplication. We should coordinate our efforts, and share 
code freely so that other projects (bixo, heritrix,droids) may contribute to 
this shared pool of functionality, much like Tika does for the common need of 
parsing complex formats.
-   * Remove index / search and delegate to SOLR
+  * --(Remove index / search and delegate to SOLR )--
-     * we may still keep a thin abstract layer to allow other indexing/search 
backends (ElasticSearch?), but the current mess of indexing/query filters and 
competing indexing frameworks (lucene, fields, solr) should go away. We should 
go directly from DOM to a NutchDocument, and stop there.
+   * we may still keep a thin abstract layer to allow other indexing/search 
backends (ElasticSearch?), but the current mess of indexing/query filters and 
competing indexing frameworks (lucene, fields, solr) should go away. We should 
go directly from DOM to a NutchDocument, and stop there.
-   * Rewrite SOLR deduplication : do everything using the webtable and avoid 
retrieving content from SOLR 
+  * Rewrite SOLR deduplication : do everything using the webtable and avoid 
retrieving content from SOLR
-   * Various new functionalities 
+  * Various new functionalities
-     * e.g. sitemap support, canonical tag, better handling of redirects, 
detecting duplicated sites, detection of spam cliques, tools to manage the 
webgraph, etc.
+   * e.g. sitemap support, canonical tag, better handling of redirects, 
detecting duplicated sites, detection of spam cliques, tools to manage the 
webgraph, etc.
- 
  
  This document is meant to serve as a basis for discussion, feel free to 
contribute to it

[Nutch Wiki] Update of "Nutch2Roadmap" by JulienNioche

Reply via email to