Re: Nutch 2.0 roadmap

2010-04-08 Thread Doğacan Güney
On Wed, Apr 7, 2010 at 20:32, Andrzej Bialecki a...@getopt.org wrote: On 2010-04-07 18:54, Doğacan Güney wrote: Hey everyone, On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki a...@getopt.org wrote: On 2010-04-06 15:43, Julien Nioche wrote: Hi guys, I gather that we'll jump straight to  2.0

Re: Nutch 2.0 roadmap

2010-04-08 Thread Doğacan Güney
Hi, On Wed, Apr 7, 2010 at 21:19, MilleBii mille...@gmail.com wrote: Just a question ? Will the new HBase implementation allow more sophisticated crawling strategies than the current score based. Give you a few  example of what I'd like to do : Define different crawling frequency for

Re: Nutch 2.0 roadmap

2010-04-08 Thread MilleBii
Not sure what u mean by pig script, but I'd like to be able to make a multi-criteria selection of Url for fetching... The scoring method forces into a kind of mono dimensional approach which is not really easy to deal with. The regex filters are good but it assumes you want select URLs on data

Re: Nutch 2.0 roadmap

2010-04-08 Thread Doğacan Güney
On Thu, Apr 8, 2010 at 21:11, MilleBii mille...@gmail.com wrote: Not sure what u mean by pig script, but I'd like to be able to make a multi-criteria selection of Url for fetching... I mean a query language like http://hadoop.apache.org/pig/ if we expose data correctly, then you should be

Re: Nutch 2.0 roadmap

2010-04-07 Thread Julien Nioche
Hi, I'm not sure what is the status of the nutchbase - it's missed a lot of fixes and changes in trunk since it's been last touched ... yes, maybe we should start the 2.0 branch from 1.1 instead Dogacan - what do you think? BTW I see there is now a 2.0 label under JIRA, thanks to whoever

Re: Nutch 2.0 roadmap

2010-04-07 Thread Doğacan Güney
Hey everyone, On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki a...@getopt.org wrote: On 2010-04-06 15:43, Julien Nioche wrote: Hi guys, I gather that we'll jump straight to  2.0 after 1.1 and that 2.0 will be based on what is currently referred to as NutchBase. Shall we create a branch for

Re: Nutch 2.0 roadmap

2010-04-07 Thread Enis Söztutar
Hi, On 04/07/2010 07:54 PM, Doğacan Güney wrote: Hey everyone, On Tue, Apr 6, 2010 at 20:23, Andrzej Bialeckia...@getopt.org wrote: On 2010-04-06 15:43, Julien Nioche wrote: Hi guys, I gather that we'll jump straight to 2.0 after 1.1 and that 2.0 will be based on what is

Re: Nutch 2.0 roadmap

2010-04-07 Thread Enis Söztutar
Forgot to say that, at Hadoop, it is the convention that big issues, like the ones under discussion come with a design document. So that a solid design is agreed upon for the work. We can apply the same pattern at Nutch. On 04/07/2010 07:54 PM, Doğacan Güney wrote: Hey everyone, On Tue, Apr

Re: Nutch 2.0 roadmap

2010-04-07 Thread Andrzej Bialecki
On 2010-04-07 18:54, Doğacan Güney wrote: Hey everyone, On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki a...@getopt.org wrote: On 2010-04-06 15:43, Julien Nioche wrote: Hi guys, I gather that we'll jump straight to 2.0 after 1.1 and that 2.0 will be based on what is currently referred to

Re: Nutch 2.0 roadmap

2010-04-07 Thread Andrzej Bialecki
On 2010-04-07 19:24, Enis Söztutar wrote: Also, the goal of the crawler-commons project is to provide APIs and implementations of stuff that is needed for every open source crawler project, like: robots handling, url filtering and url normalization, URL state management, perhaps

Re: Nutch 2.0 roadmap

2010-04-07 Thread MilleBii
Just a question ? Will the new HBase implementation allow more sophisticated crawling strategies than the current score based. Give you a few example of what I'd like to do : Define different crawling frequency for different set of URLs, say weekly for some url, monthly or more for others.

Re: Nutch 2.0 roadmap

2010-04-06 Thread Andrzej Bialecki
On 2010-04-06 15:43, Julien Nioche wrote: Hi guys, I gather that we'll jump straight to 2.0 after 1.1 and that 2.0 will be based on what is currently referred to as NutchBase. Shall we create a branch for 2.0 in the Nutch SVN repository and have a label accordingly for JIRA so that we can