On Thu, Apr 8, 2010 at 21:11, MilleBii <mille...@gmail.com> wrote: > Not sure what u mean by pig script, but I'd like to be able to make a > multi-criteria selection of Url for fetching...
I mean a query language like http://hadoop.apache.org/pig/ if we expose data correctly, then you should be able to generate on any criteria that you want. > The scoring method forces into a kind of mono dimensional approach > which is not really easy to deal with. > > The regex filters are good but it assumes you want select URLs on data > which is in the URL... Pretty limited in fact > > I basically would like to do 'content' based crawling. Say for > example: that I'm interested in "topic A". > I'd'like to label URLs that match "Topic A" (user supplied logic). > Later on I would want to crawl "topic A" urls at a certain frequency > and non labeled urls for exploring in a different way. > > This looks like hard to do right now > > 2010/4/8, Doğacan Güney <doga...@gmail.com>: >> Hi, >> >> On Wed, Apr 7, 2010 at 21:19, MilleBii <mille...@gmail.com> wrote: >>> Just a question ? >>> Will the new HBase implementation allow more sophisticated crawling >>> strategies than the current score based. >>> >>> Give you a few example of what I'd like to do : >>> Define different crawling frequency for different set of URLs, say >>> weekly for some url, monthly or more for others. >>> >>> Select URLs to re-crawl based on attributes previously extracted.Just >>> one example: recrawl urls that contained a certain keyword (or set of) >>> >>> Select URLs that have not yet been crawled, at the frontier of the >>> crawl therefore >>> >> >> At some point, it would be nice to change generator so that it is only a >> handful >> of methods and a pig (or something else) script. So, we would provide >> most of the functions >> you may need during generation (accessing various data) but actual >> generation would be a pig >> process. This way, anyone can easily change generate any way they want >> (even make it more jobs >> than 2 if they want more complex schemes). >> >>> >>> >>> >>> 2010/4/7, Doğacan Güney <doga...@gmail.com>: >>>> Hey everyone, >>>> >>>> On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki <a...@getopt.org> wrote: >>>>> On 2010-04-06 15:43, Julien Nioche wrote: >>>>>> Hi guys, >>>>>> >>>>>> I gather that we'll jump straight to 2.0 after 1.1 and that 2.0 will >>>>>> be >>>>>> based on what is currently referred to as NutchBase. Shall we create a >>>>>> branch for 2.0 in the Nutch SVN repository and have a label accordingly >>>>>> for >>>>>> JIRA so that we can file issues / feature requests on 2.0? Do you think >>>>>> that >>>>>> the current NutchBase could be used as a basis for the 2.0 branch? >>>>> >>>>> I'm not sure what is the status of the nutchbase - it's missed a lot of >>>>> fixes and changes in trunk since it's been last touched ... >>>>> >>>> >>>> I know... But I still intend to finish it, I just need to schedule >>>> some time for it. >>>> >>>> My vote would be to go with nutchbase. >>>> >>>>>> >>>>>> Talking about features, what else would we add apart from : >>>>>> >>>>>> * support for HBase : via ORM or not (see >>>>>> NUTCH-808<https://issues.apache.org/jira/browse/NUTCH-808> >>>>>> ) >>>>> >>>>> This IMHO is promising, this could open the doors to small-to-medium >>>>> installations that are currently too cumbersome to handle. >>>>> >>>> >>>> Yeah, there is already a simple ORM within nutchbase that is >>>> avro-based and should >>>> be generic enough to also support MySQL, cassandra and berkeleydb. But >>>> any good ORM will >>>> be a very good addition. >>>> >>>>>> * plugin cleanup : Tika only for parsing - get rid of everything else? >>>>> >>>>> Basically, yes - keep only stuff like HtmlParseFilters (probably with a >>>>> different API) so that we can post-process the DOM created in Tika from >>>>> whatever original format. >>>>> >>>>> Also, the goal of the crawler-commons project is to provide APIs and >>>>> implementations of stuff that is needed for every open source crawler >>>>> project, like: robots handling, url filtering and url normalization, URL >>>>> state management, perhaps deduplication. We should coordinate our >>>>> efforts, and share code freely so that other projects (bixo, heritrix, >>>>> droids) may contribute to this shared pool of functionality, much like >>>>> Tika does for the common need of parsing complex formats. >>>>> >>>>>> * remove index / search and delegate to SOLR >>>>> >>>>> +1 - we may still keep a thin abstract layer to allow other >>>>> indexing/search backends, but the current mess of indexing/query filters >>>>> and competing indexing frameworks (lucene, fields, solr) should go away. >>>>> We should go directly from DOM to a NutchDocument, and stop there. >>>>> >>>> >>>> Agreed. I would like to add support for katta and other indexing >>>> backends at some point but >>>> NutchDocument should be our canonical representation. The rest should >>>> be up to indexing backends. >>>> >>>>> Regarding search - currently the search API is too low-level, with the >>>>> custom text and query analysis chains. This needlessly introduces the >>>>> (in)famous Nutch Query classes and Nutch query syntax limitations, We >>>>> should get rid of it and simply leave this part of the processing to the >>>>> search backend. Probably we will use the SolrCloud branch that supports >>>>> sharding and global IDF. >>>>> >>>>>> * new functionalities e.g. sitemap support, canonical tag etc... >>>>> >>>>> Plus a better handling of redirects, detecting duplicated sites, >>>>> detection of spam cliques, tools to manage the webgraph, etc. >>>>> >>>>>> >>>>>> I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an >>>>>> update? >>>>> >>>>> Definitely. :) >>>>> >>>>> -- >>>>> Best regards, >>>>> Andrzej Bialecki <>< >>>>> ___. ___ ___ ___ _ _ __________________________________ >>>>> [__ || __|__/|__||\/| Information Retrieval, Semantic Web >>>>> ___|||__|| \| || | Embedded Unix, System Integration >>>>> http://www.sigram.com Contact: info at sigram dot com >>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> Doğacan Güney >>>> >>> >>> >>> -- >>> -MilleBii- >>> >> >> >> >> -- >> Doğacan Güney >> > > > -- > -MilleBii- > -- Doğacan Güney