[Nutch Wiki] Update of "Nutch2Roadmap" by JulienNioche

Apache Wiki Wed, 07 Apr 2010 01:36:48 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "Nutch2Roadmap" page has been changed by JulienNioche.
http://wiki.apache.org/nutch/Nutch2Roadmap?action=diff&rev1=1&rev2=2

--------------------------------------------------

    * Storage Abstraction
      * initially with back end implementations for HBase and HDFS  
      * extend it to other storages later e.g. MySQL etc...
-   * Plugin cleanup : Tika only for parsing document formats
+   * Plugin cleanup : Tika only for parsing document formats (see 
http://wiki.apache.org/nutch/TikaPlugin)
      * keep only stuff HtmlParseFilters (probably with a different API) so 
that we can post-process the DOM created in Tika from whatever original format.
    * Externalize functionalities to crawler-commons project 
[http://code.google.com/p/crawler-commons/] 
      * robots handling, url filtering and url normalization, URL state 
management, perhaps deduplication. We should coordinate our efforts, and share 
code freely so that other projects (bixo, heritrix,droids) may contribute to 
this shared pool of functionality, much like Tika does for the common need of 
parsing complex formats.

[Nutch Wiki] Update of "Nutch2Roadmap" by JulienNioche

Reply via email to