Just a question ? Will the new HBase implementation allow more sophisticated crawling strategies than the current score based.
Give you a few example of what I'd like to do : Define different crawling frequency for different set of URLs, say weekly for some url, monthly or more for others. Select URLs to re-crawl based on attributes previously extracted.Just one example: recrawl urls that contained a certain keyword (or set of) Select URLs that have not yet been crawled, at the frontier of the crawl therefore 2010/4/7, Doğacan Güney <doga...@gmail.com>: > Hey everyone, > > On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki <a...@getopt.org> wrote: >> On 2010-04-06 15:43, Julien Nioche wrote: >>> Hi guys, >>> >>> I gather that we'll jump straight to 2.0 after 1.1 and that 2.0 will be >>> based on what is currently referred to as NutchBase. Shall we create a >>> branch for 2.0 in the Nutch SVN repository and have a label accordingly >>> for >>> JIRA so that we can file issues / feature requests on 2.0? Do you think >>> that >>> the current NutchBase could be used as a basis for the 2.0 branch? >> >> I'm not sure what is the status of the nutchbase - it's missed a lot of >> fixes and changes in trunk since it's been last touched ... >> > > I know... But I still intend to finish it, I just need to schedule > some time for it. > > My vote would be to go with nutchbase. > >>> >>> Talking about features, what else would we add apart from : >>> >>> * support for HBase : via ORM or not (see >>> NUTCH-808<https://issues.apache.org/jira/browse/NUTCH-808> >>> ) >> >> This IMHO is promising, this could open the doors to small-to-medium >> installations that are currently too cumbersome to handle. >> > > Yeah, there is already a simple ORM within nutchbase that is > avro-based and should > be generic enough to also support MySQL, cassandra and berkeleydb. But > any good ORM will > be a very good addition. > >>> * plugin cleanup : Tika only for parsing - get rid of everything else? >> >> Basically, yes - keep only stuff like HtmlParseFilters (probably with a >> different API) so that we can post-process the DOM created in Tika from >> whatever original format. >> >> Also, the goal of the crawler-commons project is to provide APIs and >> implementations of stuff that is needed for every open source crawler >> project, like: robots handling, url filtering and url normalization, URL >> state management, perhaps deduplication. We should coordinate our >> efforts, and share code freely so that other projects (bixo, heritrix, >> droids) may contribute to this shared pool of functionality, much like >> Tika does for the common need of parsing complex formats. >> >>> * remove index / search and delegate to SOLR >> >> +1 - we may still keep a thin abstract layer to allow other >> indexing/search backends, but the current mess of indexing/query filters >> and competing indexing frameworks (lucene, fields, solr) should go away. >> We should go directly from DOM to a NutchDocument, and stop there. >> > > Agreed. I would like to add support for katta and other indexing > backends at some point but > NutchDocument should be our canonical representation. The rest should > be up to indexing backends. > >> Regarding search - currently the search API is too low-level, with the >> custom text and query analysis chains. This needlessly introduces the >> (in)famous Nutch Query classes and Nutch query syntax limitations, We >> should get rid of it and simply leave this part of the processing to the >> search backend. Probably we will use the SolrCloud branch that supports >> sharding and global IDF. >> >>> * new functionalities e.g. sitemap support, canonical tag etc... >> >> Plus a better handling of redirects, detecting duplicated sites, >> detection of spam cliques, tools to manage the webgraph, etc. >> >>> >>> I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an >>> update? >> >> Definitely. :) >> >> -- >> Best regards, >> Andrzej Bialecki <>< >> ___. ___ ___ ___ _ _ __________________________________ >> [__ || __|__/|__||\/| Information Retrieval, Semantic Web >> ___|||__|| \| || | Embedded Unix, System Integration >> http://www.sigram.com Contact: info at sigram dot com >> >> > > > > -- > Doğacan Güney > -- -MilleBii-