Forgot to say that, at Hadoop, it is the convention that big issues, like the ones under discussion come with a design document. So that a solid design is agreed upon for the work. We can apply the same pattern at Nutch.

On 04/07/2010 07:54 PM, Doğacan Güney wrote:
Hey everyone,

On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki<a...@getopt.org>  wrote:
On 2010-04-06 15:43, Julien Nioche wrote:
Hi guys,

I gather that we'll jump straight to  2.0 after 1.1 and that 2.0 will be
based on what is currently referred to as NutchBase. Shall we create a
branch for 2.0 in the Nutch SVN repository and have a label accordingly for
JIRA so that we can file issues / feature requests on 2.0? Do you think that
the current NutchBase could be used as a basis for the 2.0 branch?
I'm not sure what is the status of the nutchbase - it's missed a lot of
fixes and changes in trunk since it's been last touched ...

I know... But I still intend to finish it, I just need to schedule
some time for it.

My vote would be to go with nutchbase.

Talking about features, what else would we add apart from :

* support for HBase : via ORM or not (see
NUTCH-808<https://issues.apache.org/jira/browse/NUTCH-808>
)
This IMHO is promising, this could open the doors to small-to-medium
installations that are currently too cumbersome to handle.

Yeah, there is already a simple ORM within nutchbase that is
avro-based and should
be generic enough to also support MySQL, cassandra and berkeleydb. But
any good ORM will
be a very good addition.

* plugin cleanup : Tika only for parsing - get rid of everything else?
Basically, yes - keep only stuff like HtmlParseFilters (probably with a
different API) so that we can post-process the DOM created in Tika from
whatever original format.

Also, the goal of the crawler-commons project is to provide APIs and
implementations of stuff that is needed for every open source crawler
project, like: robots handling, url filtering and url normalization, URL
state management, perhaps deduplication. We should coordinate our
efforts, and share code freely so that other projects (bixo, heritrix,
droids) may contribute to this shared pool of functionality, much like
Tika does for the common need of parsing complex formats.

* remove index / search and delegate to SOLR
+1 - we may still keep a thin abstract layer to allow other
indexing/search backends, but the current mess of indexing/query filters
and competing indexing frameworks (lucene, fields, solr) should go away.
We should go directly from DOM to a NutchDocument, and stop there.

Agreed. I would like to add support for katta and other indexing
backends at some point but
NutchDocument should be our canonical representation. The rest should
be up to indexing backends.

Regarding search - currently the search API is too low-level, with the
custom text and query analysis chains. This needlessly introduces the
(in)famous Nutch Query classes and Nutch query syntax limitations, We
should get rid of it and simply leave this part of the processing to the
search backend. Probably we will use the SolrCloud branch that supports
sharding and global IDF.

* new functionalities e.g. sitemap support, canonical tag etc...
Plus a better handling of redirects, detecting duplicated sites,
detection of spam cliques, tools to manage the webgraph, etc.

I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an
update?
Definitely. :)

--
Best regards,
Andrzej Bialecki<><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com





Reply via email to