Hey everyone, On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki <a...@getopt.org> wrote: > On 2010-04-06 15:43, Julien Nioche wrote: >> Hi guys, >> >> I gather that we'll jump straight to 2.0 after 1.1 and that 2.0 will be >> based on what is currently referred to as NutchBase. Shall we create a >> branch for 2.0 in the Nutch SVN repository and have a label accordingly for >> JIRA so that we can file issues / feature requests on 2.0? Do you think that >> the current NutchBase could be used as a basis for the 2.0 branch? > > I'm not sure what is the status of the nutchbase - it's missed a lot of > fixes and changes in trunk since it's been last touched ... >
I know... But I still intend to finish it, I just need to schedule some time for it. My vote would be to go with nutchbase. >> >> Talking about features, what else would we add apart from : >> >> * support for HBase : via ORM or not (see >> NUTCH-808<https://issues.apache.org/jira/browse/NUTCH-808> >> ) > > This IMHO is promising, this could open the doors to small-to-medium > installations that are currently too cumbersome to handle. > Yeah, there is already a simple ORM within nutchbase that is avro-based and should be generic enough to also support MySQL, cassandra and berkeleydb. But any good ORM will be a very good addition. >> * plugin cleanup : Tika only for parsing - get rid of everything else? > > Basically, yes - keep only stuff like HtmlParseFilters (probably with a > different API) so that we can post-process the DOM created in Tika from > whatever original format. > > Also, the goal of the crawler-commons project is to provide APIs and > implementations of stuff that is needed for every open source crawler > project, like: robots handling, url filtering and url normalization, URL > state management, perhaps deduplication. We should coordinate our > efforts, and share code freely so that other projects (bixo, heritrix, > droids) may contribute to this shared pool of functionality, much like > Tika does for the common need of parsing complex formats. > >> * remove index / search and delegate to SOLR > > +1 - we may still keep a thin abstract layer to allow other > indexing/search backends, but the current mess of indexing/query filters > and competing indexing frameworks (lucene, fields, solr) should go away. > We should go directly from DOM to a NutchDocument, and stop there. > Agreed. I would like to add support for katta and other indexing backends at some point but NutchDocument should be our canonical representation. The rest should be up to indexing backends. > Regarding search - currently the search API is too low-level, with the > custom text and query analysis chains. This needlessly introduces the > (in)famous Nutch Query classes and Nutch query syntax limitations, We > should get rid of it and simply leave this part of the processing to the > search backend. Probably we will use the SolrCloud branch that supports > sharding and global IDF. > >> * new functionalities e.g. sitemap support, canonical tag etc... > > Plus a better handling of redirects, detecting duplicated sites, > detection of spam cliques, tools to manage the webgraph, etc. > >> >> I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an >> update? > > Definitely. :) > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > -- Doğacan Güney