Re: The crawl command, keep or get rid of

2011-08-23 Thread Julien Nioche
+1 let's replace it with a shell script instead.

On 22 August 2011 21:56, Markus Jelsma markus.jel...@openindex.io wrote:

 Hi,

 The crawl command seems to add a lot of confusion. It hides the entire
 crawl
 cycle logic from new users, leading to questions, lack of understanding of
 basic Nutch concepts, unsupported switches of the jobs it executes, more
 problems etc. I am quite an opponent of the crawl command and would also
 not
 recommend it to anyone including new users. A running Nutch almost always
 requires some scripting here and there, cron jobs, locks etc.

 I propose (most likely a challenging statement) to deprecate the crawl
 command
 in 1.4.

 Users, developers, please comment.

 Thanks




-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com


Re: The crawl command, keep or get rid of

2011-08-23 Thread Markus Jelsma
What kind of shell script did you have in mind? The wiki already provides some 
useful scripts. The tutorials on Nutch also show commands that can be used in 
custom scripts.

Is an immediate crawl-with-one-command a desired feature? Provided as Java 
code or shell script?

On Tuesday 23 August 2011 10:12:57 Julien Nioche wrote:
 +1 let's replace it with a shell script instead.
 
 On 22 August 2011 21:56, Markus Jelsma markus.jel...@openindex.io wrote:
  Hi,
  
  The crawl command seems to add a lot of confusion. It hides the entire
  crawl
  cycle logic from new users, leading to questions, lack of understanding
  of basic Nutch concepts, unsupported switches of the jobs it executes,
  more problems etc. I am quite an opponent of the crawl command and would
  also not
  recommend it to anyone including new users. A running Nutch almost always
  requires some scripting here and there, cron jobs, locks etc.
  
  I propose (most likely a challenging statement) to deprecate the crawl
  command
  in 1.4.
  
  Users, developers, please comment.
  
  Thanks

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: The crawl command, keep or get rid of

2011-08-23 Thread Julien Nioche
 What kind of shell script did you have in mind? The wiki already provides
 some
 useful scripts. The tutorials on Nutch also show commands that can be used
 in
 custom scripts.


That's exactly my point. There are various scripts in the wiki, based on
different versions of Nutch and of variable quality (e.g. some  won't work
in distributed mode) etc... Let's have one in the repository so that people
stop reinventing the wheel or ask where to get one.
Of course most of the script will examplify the commands from the Wiki and
it will have a good educational value as well as being useful

Julien


 Is an immediate crawl-with-one-command a desired feature? Provided as Java
 code or shell script?

 On Tuesday 23 August 2011 10:12:57 Julien Nioche wrote:
  +1 let's replace it with a shell script instead.
 
  On 22 August 2011 21:56, Markus Jelsma markus.jel...@openindex.io
 wrote:
   Hi,
  
   The crawl command seems to add a lot of confusion. It hides the entire
   crawl
   cycle logic from new users, leading to questions, lack of understanding
   of basic Nutch concepts, unsupported switches of the jobs it executes,
   more problems etc. I am quite an opponent of the crawl command and
 would
   also not
   recommend it to anyone including new users. A running Nutch almost
 always
   requires some scripting here and there, cron jobs, locks etc.
  
   I propose (most likely a challenging statement) to deprecate the crawl
   command
   in 1.4.
  
   Users, developers, please comment.
  
   Thanks

 --
 Markus Jelsma - CTO - Openindex
 http://www.linkedin.com/in/markus17
 050-8536620 / 06-50258350




-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com


Re: The crawl command, keep or get rid of

2011-08-23 Thread Markus Jelsma
You're right: https://issues.apache.org/jira/browse/NUTCH-1087

On Tuesday 23 August 2011 13:24:27 Julien Nioche wrote:
  What kind of shell script did you have in mind? The wiki already provides
  some
  useful scripts. The tutorials on Nutch also show commands that can be
  used in
  custom scripts.
 
 That's exactly my point. There are various scripts in the wiki, based on
 different versions of Nutch and of variable quality (e.g. some  won't work
 in distributed mode) etc... Let's have one in the repository so that people
 stop reinventing the wheel or ask where to get one.
 Of course most of the script will examplify the commands from the Wiki and
 it will have a good educational value as well as being useful
 
 Julien
 
  Is an immediate crawl-with-one-command a desired feature? Provided as
  Java code or shell script?
  
  On Tuesday 23 August 2011 10:12:57 Julien Nioche wrote:
   +1 let's replace it with a shell script instead.
   
   On 22 August 2011 21:56, Markus Jelsma markus.jel...@openindex.io
  
  wrote:
Hi,

The crawl command seems to add a lot of confusion. It hides the
entire crawl
cycle logic from new users, leading to questions, lack of
understanding of basic Nutch concepts, unsupported switches of the
jobs it executes, more problems etc. I am quite an opponent of the
crawl command and
  
  would
  
also not
recommend it to anyone including new users. A running Nutch almost
  
  always
  
requires some scripting here and there, cron jobs, locks etc.

I propose (most likely a challenging statement) to deprecate the
crawl command
in 1.4.

Users, developers, please comment.

Thanks
  
  --
  Markus Jelsma - CTO - Openindex
  http://www.linkedin.com/in/markus17
  050-8536620 / 06-50258350

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: The crawl command, keep or get rid of

2011-08-23 Thread Radim Kolar

I agree. Nuke crawl command


Re: The crawl command, keep or get rid of

2011-08-23 Thread Eric Pugh
I wonder if the name crawl implies that the command is sort of standard 
command, and all you would need?  After all, if I where to sit down with a 
crawler, it seems very logical that crawl would be how you run it!  I like 
the simplicity of crawl from a getting started approach.  I agree though that 
I know I used it as a short cut...  I didn't want to learn all the lower level 
concepts, I just wanted to crawl a couple URLs and toss them into Solr.  
crawl and the example code did great!

Maybe instead of having crawl be a core part of running Nutch, instead it's 
run-example-crawl.sh and in the Wiki it's caveated that you should then look 
inside it and learn all the various steps.  

Eric


On Aug 23, 2011, at 6:50 AM, Markus Jelsma wrote:

 What kind of shell script did you have in mind? The wiki already provides 
 some 
 useful scripts. The tutorials on Nutch also show commands that can be used in 
 custom scripts.
 
 Is an immediate crawl-with-one-command a desired feature? Provided as Java 
 code or shell script?
 
 On Tuesday 23 August 2011 10:12:57 Julien Nioche wrote:
 +1 let's replace it with a shell script instead.
 
 On 22 August 2011 21:56, Markus Jelsma markus.jel...@openindex.io wrote:
 Hi,
 
 The crawl command seems to add a lot of confusion. It hides the entire
 crawl
 cycle logic from new users, leading to questions, lack of understanding
 of basic Nutch concepts, unsupported switches of the jobs it executes,
 more problems etc. I am quite an opponent of the crawl command and would
 also not
 recommend it to anyone including new users. A running Nutch almost always
 requires some scripting here and there, cron jobs, locks etc.
 
 I propose (most likely a challenging statement) to deprecate the crawl
 command
 in 1.4.
 
 Users, developers, please comment.
 
 Thanks
 
 -- 
 Markus Jelsma - CTO - Openindex
 http://www.linkedin.com/in/markus17
 050-8536620 / 06-50258350

-
Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com
Co-Author: Solr 1.4 Enterprise Search Server available from 
http://www.packtpub.com/solr-1-4-enterprise-search-server
This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.