Re: Nutch 2.0 roadmap

Doğacan Güney Wed, 07 Apr 2010 09:55:34 -0700

Hey everyone,

On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki <a...@getopt.org> wrote:
> On 2010-04-06 15:43, Julien Nioche wrote:
>> Hi guys,
>>
>> I gather that we'll jump straight to  2.0 after 1.1 and that 2.0 will be
>> based on what is currently referred to as NutchBase. Shall we create a
>> branch for 2.0 in the Nutch SVN repository and have a label accordingly for
>> JIRA so that we can file issues / feature requests on 2.0? Do you think that
>> the current NutchBase could be used as a basis for the 2.0 branch?
>
> I'm not sure what is the status of the nutchbase - it's missed a lot of
> fixes and changes in trunk since it's been last touched ...
>


I know... But I still intend to finish it, I just need to schedule
some time for it.

My vote would be to go with nutchbase.

>>
>> Talking about features, what else would we add apart from :
>>
>> * support for HBase : via ORM or not (see
>> NUTCH-808<https://issues.apache.org/jira/browse/NUTCH-808>
>> )
>
> This IMHO is promising, this could open the doors to small-to-medium
> installations that are currently too cumbersome to handle.
>

Yeah, there is already a simple ORM within nutchbase that is
avro-based and should
be generic enough to also support MySQL, cassandra and berkeleydb. But
any good ORM will
be a very good addition.

>> * plugin cleanup : Tika only for parsing - get rid of everything else?
>
> Basically, yes - keep only stuff like HtmlParseFilters (probably with a
> different API) so that we can post-process the DOM created in Tika from
> whatever original format.
>
> Also, the goal of the crawler-commons project is to provide APIs and
> implementations of stuff that is needed for every open source crawler
> project, like: robots handling, url filtering and url normalization, URL
> state management, perhaps deduplication. We should coordinate our
> efforts, and share code freely so that other projects (bixo, heritrix,
> droids) may contribute to this shared pool of functionality, much like
> Tika does for the common need of parsing complex formats.
>
>> * remove index / search and delegate to SOLR
>
> +1 - we may still keep a thin abstract layer to allow other
> indexing/search backends, but the current mess of indexing/query filters
> and competing indexing frameworks (lucene, fields, solr) should go away.
> We should go directly from DOM to a NutchDocument, and stop there.
>

Agreed. I would like to add support for katta and other indexing
backends at some point but
NutchDocument should be our canonical representation. The rest should
be up to indexing backends.

> Regarding search - currently the search API is too low-level, with the
> custom text and query analysis chains. This needlessly introduces the
> (in)famous Nutch Query classes and Nutch query syntax limitations, We
> should get rid of it and simply leave this part of the processing to the
> search backend. Probably we will use the SolrCloud branch that supports
> sharding and global IDF.
>
>> * new functionalities e.g. sitemap support, canonical tag etc...
>
> Plus a better handling of redirects, detecting duplicated sites,
> detection of spam cliques, tools to manage the webgraph, etc.
>
>>
>> I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an
>> update?
>
> Definitely. :)
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>



-- 
Doğacan Güney

Re: Nutch 2.0 roadmap

Reply via email to