Re: Nutch 2.0 roadmap

MilleBii Wed, 07 Apr 2010 11:19:40 -0700

Just a question ?
Will the new HBase implementation allow more sophisticated crawling
strategies than the current score based.


Give you a few  example of what I'd like to do :
Define different crawling frequency for different set of URLs, say
weekly for some url, monthly or more for others.

Select URLs to re-crawl based on attributes previously extracted.Just
one example: recrawl urls that contained a certain keyword (or set of)

Select URLs that have not yet been crawled, at the frontier of the
crawl therefore




2010/4/7, Doğacan Güney <doga...@gmail.com>:
> Hey everyone,
>
> On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki <a...@getopt.org> wrote:
>> On 2010-04-06 15:43, Julien Nioche wrote:
>>> Hi guys,
>>>
>>> I gather that we'll jump straight to  2.0 after 1.1 and that 2.0 will be
>>> based on what is currently referred to as NutchBase. Shall we create a
>>> branch for 2.0 in the Nutch SVN repository and have a label accordingly
>>> for
>>> JIRA so that we can file issues / feature requests on 2.0? Do you think
>>> that
>>> the current NutchBase could be used as a basis for the 2.0 branch?
>>
>> I'm not sure what is the status of the nutchbase - it's missed a lot of
>> fixes and changes in trunk since it's been last touched ...
>>
>
> I know... But I still intend to finish it, I just need to schedule
> some time for it.
>
> My vote would be to go with nutchbase.
>
>>>
>>> Talking about features, what else would we add apart from :
>>>
>>> * support for HBase : via ORM or not (see
>>> NUTCH-808<https://issues.apache.org/jira/browse/NUTCH-808>
>>> )
>>
>> This IMHO is promising, this could open the doors to small-to-medium
>> installations that are currently too cumbersome to handle.
>>
>
> Yeah, there is already a simple ORM within nutchbase that is
> avro-based and should
> be generic enough to also support MySQL, cassandra and berkeleydb. But
> any good ORM will
> be a very good addition.
>
>>> * plugin cleanup : Tika only for parsing - get rid of everything else?
>>
>> Basically, yes - keep only stuff like HtmlParseFilters (probably with a
>> different API) so that we can post-process the DOM created in Tika from
>> whatever original format.
>>
>> Also, the goal of the crawler-commons project is to provide APIs and
>> implementations of stuff that is needed for every open source crawler
>> project, like: robots handling, url filtering and url normalization, URL
>> state management, perhaps deduplication. We should coordinate our
>> efforts, and share code freely so that other projects (bixo, heritrix,
>> droids) may contribute to this shared pool of functionality, much like
>> Tika does for the common need of parsing complex formats.
>>
>>> * remove index / search and delegate to SOLR
>>
>> +1 - we may still keep a thin abstract layer to allow other
>> indexing/search backends, but the current mess of indexing/query filters
>> and competing indexing frameworks (lucene, fields, solr) should go away.
>> We should go directly from DOM to a NutchDocument, and stop there.
>>
>
> Agreed. I would like to add support for katta and other indexing
> backends at some point but
> NutchDocument should be our canonical representation. The rest should
> be up to indexing backends.
>
>> Regarding search - currently the search API is too low-level, with the
>> custom text and query analysis chains. This needlessly introduces the
>> (in)famous Nutch Query classes and Nutch query syntax limitations, We
>> should get rid of it and simply leave this part of the processing to the
>> search backend. Probably we will use the SolrCloud branch that supports
>> sharding and global IDF.
>>
>>> * new functionalities e.g. sitemap support, canonical tag etc...
>>
>> Plus a better handling of redirects, detecting duplicated sites,
>> detection of spam cliques, tools to manage the webgraph, etc.
>>
>>>
>>> I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an
>>> update?
>>
>> Definitely. :)
>>
>> --
>> Best regards,
>> Andrzej Bialecki     <><
>>  ___. ___ ___ ___ _ _   __________________________________
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com
>>
>>
>
>
>
> --
> Doğacan Güney
>


-- 
-MilleBii-

Re: Nutch 2.0 roadmap

Reply via email to