Hi, On Wed, Apr 7, 2010 at 21:19, MilleBii <mille...@gmail.com> wrote: > Just a question ? > Will the new HBase implementation allow more sophisticated crawling > strategies than the current score based. > > Give you a few example of what I'd like to do : > Define different crawling frequency for different set of URLs, say > weekly for some url, monthly or more for others. > > Select URLs to re-crawl based on attributes previously extracted.Just > one example: recrawl urls that contained a certain keyword (or set of) > > Select URLs that have not yet been crawled, at the frontier of the > crawl therefore >
At some point, it would be nice to change generator so that it is only a handful of methods and a pig (or something else) script. So, we would provide most of the functions you may need during generation (accessing various data) but actual generation would be a pig process. This way, anyone can easily change generate any way they want (even make it more jobs than 2 if they want more complex schemes). > > > > 2010/4/7, Doğacan Güney <doga...@gmail.com>: >> Hey everyone, >> >> On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki <a...@getopt.org> wrote: >>> On 2010-04-06 15:43, Julien Nioche wrote: >>>> Hi guys, >>>> >>>> I gather that we'll jump straight to 2.0 after 1.1 and that 2.0 will be >>>> based on what is currently referred to as NutchBase. Shall we create a >>>> branch for 2.0 in the Nutch SVN repository and have a label accordingly >>>> for >>>> JIRA so that we can file issues / feature requests on 2.0? Do you think >>>> that >>>> the current NutchBase could be used as a basis for the 2.0 branch? >>> >>> I'm not sure what is the status of the nutchbase - it's missed a lot of >>> fixes and changes in trunk since it's been last touched ... >>> >> >> I know... But I still intend to finish it, I just need to schedule >> some time for it. >> >> My vote would be to go with nutchbase. >> >>>> >>>> Talking about features, what else would we add apart from : >>>> >>>> * support for HBase : via ORM or not (see >>>> NUTCH-808<https://issues.apache.org/jira/browse/NUTCH-808> >>>> ) >>> >>> This IMHO is promising, this could open the doors to small-to-medium >>> installations that are currently too cumbersome to handle. >>> >> >> Yeah, there is already a simple ORM within nutchbase that is >> avro-based and should >> be generic enough to also support MySQL, cassandra and berkeleydb. But >> any good ORM will >> be a very good addition. >> >>>> * plugin cleanup : Tika only for parsing - get rid of everything else? >>> >>> Basically, yes - keep only stuff like HtmlParseFilters (probably with a >>> different API) so that we can post-process the DOM created in Tika from >>> whatever original format. >>> >>> Also, the goal of the crawler-commons project is to provide APIs and >>> implementations of stuff that is needed for every open source crawler >>> project, like: robots handling, url filtering and url normalization, URL >>> state management, perhaps deduplication. We should coordinate our >>> efforts, and share code freely so that other projects (bixo, heritrix, >>> droids) may contribute to this shared pool of functionality, much like >>> Tika does for the common need of parsing complex formats. >>> >>>> * remove index / search and delegate to SOLR >>> >>> +1 - we may still keep a thin abstract layer to allow other >>> indexing/search backends, but the current mess of indexing/query filters >>> and competing indexing frameworks (lucene, fields, solr) should go away. >>> We should go directly from DOM to a NutchDocument, and stop there. >>> >> >> Agreed. I would like to add support for katta and other indexing >> backends at some point but >> NutchDocument should be our canonical representation. The rest should >> be up to indexing backends. >> >>> Regarding search - currently the search API is too low-level, with the >>> custom text and query analysis chains. This needlessly introduces the >>> (in)famous Nutch Query classes and Nutch query syntax limitations, We >>> should get rid of it and simply leave this part of the processing to the >>> search backend. Probably we will use the SolrCloud branch that supports >>> sharding and global IDF. >>> >>>> * new functionalities e.g. sitemap support, canonical tag etc... >>> >>> Plus a better handling of redirects, detecting duplicated sites, >>> detection of spam cliques, tools to manage the webgraph, etc. >>> >>>> >>>> I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an >>>> update? >>> >>> Definitely. :) >>> >>> -- >>> Best regards, >>> Andrzej Bialecki <>< >>> ___. ___ ___ ___ _ _ __________________________________ >>> [__ || __|__/|__||\/| Information Retrieval, Semantic Web >>> ___|||__|| \| || | Embedded Unix, System Integration >>> http://www.sigram.com Contact: info at sigram dot com >>> >>> >> >> >> >> -- >> Doğacan Güney >> > > > -- > -MilleBii- > -- Doğacan Güney