Re: The crawl command, keep or get rid of
+1 let's replace it with a shell script instead. On 22 August 2011 21:56, Markus Jelsma markus.jel...@openindex.io wrote: Hi, The crawl command seems to add a lot of confusion. It hides the entire crawl cycle logic from new users, leading to questions, lack of understanding of basic Nutch concepts, unsupported switches of the jobs it executes, more problems etc. I am quite an opponent of the crawl command and would also not recommend it to anyone including new users. A running Nutch almost always requires some scripting here and there, cron jobs, locks etc. I propose (most likely a challenging statement) to deprecate the crawl command in 1.4. Users, developers, please comment. Thanks -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com
Re: The crawl command, keep or get rid of
What kind of shell script did you have in mind? The wiki already provides some useful scripts. The tutorials on Nutch also show commands that can be used in custom scripts. Is an immediate crawl-with-one-command a desired feature? Provided as Java code or shell script? On Tuesday 23 August 2011 10:12:57 Julien Nioche wrote: +1 let's replace it with a shell script instead. On 22 August 2011 21:56, Markus Jelsma markus.jel...@openindex.io wrote: Hi, The crawl command seems to add a lot of confusion. It hides the entire crawl cycle logic from new users, leading to questions, lack of understanding of basic Nutch concepts, unsupported switches of the jobs it executes, more problems etc. I am quite an opponent of the crawl command and would also not recommend it to anyone including new users. A running Nutch almost always requires some scripting here and there, cron jobs, locks etc. I propose (most likely a challenging statement) to deprecate the crawl command in 1.4. Users, developers, please comment. Thanks -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: The crawl command, keep or get rid of
What kind of shell script did you have in mind? The wiki already provides some useful scripts. The tutorials on Nutch also show commands that can be used in custom scripts. That's exactly my point. There are various scripts in the wiki, based on different versions of Nutch and of variable quality (e.g. some won't work in distributed mode) etc... Let's have one in the repository so that people stop reinventing the wheel or ask where to get one. Of course most of the script will examplify the commands from the Wiki and it will have a good educational value as well as being useful Julien Is an immediate crawl-with-one-command a desired feature? Provided as Java code or shell script? On Tuesday 23 August 2011 10:12:57 Julien Nioche wrote: +1 let's replace it with a shell script instead. On 22 August 2011 21:56, Markus Jelsma markus.jel...@openindex.io wrote: Hi, The crawl command seems to add a lot of confusion. It hides the entire crawl cycle logic from new users, leading to questions, lack of understanding of basic Nutch concepts, unsupported switches of the jobs it executes, more problems etc. I am quite an opponent of the crawl command and would also not recommend it to anyone including new users. A running Nutch almost always requires some scripting here and there, cron jobs, locks etc. I propose (most likely a challenging statement) to deprecate the crawl command in 1.4. Users, developers, please comment. Thanks -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350 -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com
Re: The crawl command, keep or get rid of
You're right: https://issues.apache.org/jira/browse/NUTCH-1087 On Tuesday 23 August 2011 13:24:27 Julien Nioche wrote: What kind of shell script did you have in mind? The wiki already provides some useful scripts. The tutorials on Nutch also show commands that can be used in custom scripts. That's exactly my point. There are various scripts in the wiki, based on different versions of Nutch and of variable quality (e.g. some won't work in distributed mode) etc... Let's have one in the repository so that people stop reinventing the wheel or ask where to get one. Of course most of the script will examplify the commands from the Wiki and it will have a good educational value as well as being useful Julien Is an immediate crawl-with-one-command a desired feature? Provided as Java code or shell script? On Tuesday 23 August 2011 10:12:57 Julien Nioche wrote: +1 let's replace it with a shell script instead. On 22 August 2011 21:56, Markus Jelsma markus.jel...@openindex.io wrote: Hi, The crawl command seems to add a lot of confusion. It hides the entire crawl cycle logic from new users, leading to questions, lack of understanding of basic Nutch concepts, unsupported switches of the jobs it executes, more problems etc. I am quite an opponent of the crawl command and would also not recommend it to anyone including new users. A running Nutch almost always requires some scripting here and there, cron jobs, locks etc. I propose (most likely a challenging statement) to deprecate the crawl command in 1.4. Users, developers, please comment. Thanks -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350 -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: The crawl command, keep or get rid of
I agree. Nuke crawl command
Re: The crawl command, keep or get rid of
I wonder if the name crawl implies that the command is sort of standard command, and all you would need? After all, if I where to sit down with a crawler, it seems very logical that crawl would be how you run it! I like the simplicity of crawl from a getting started approach. I agree though that I know I used it as a short cut... I didn't want to learn all the lower level concepts, I just wanted to crawl a couple URLs and toss them into Solr. crawl and the example code did great! Maybe instead of having crawl be a core part of running Nutch, instead it's run-example-crawl.sh and in the Wiki it's caveated that you should then look inside it and learn all the various steps. Eric On Aug 23, 2011, at 6:50 AM, Markus Jelsma wrote: What kind of shell script did you have in mind? The wiki already provides some useful scripts. The tutorials on Nutch also show commands that can be used in custom scripts. Is an immediate crawl-with-one-command a desired feature? Provided as Java code or shell script? On Tuesday 23 August 2011 10:12:57 Julien Nioche wrote: +1 let's replace it with a shell script instead. On 22 August 2011 21:56, Markus Jelsma markus.jel...@openindex.io wrote: Hi, The crawl command seems to add a lot of confusion. It hides the entire crawl cycle logic from new users, leading to questions, lack of understanding of basic Nutch concepts, unsupported switches of the jobs it executes, more problems etc. I am quite an opponent of the crawl command and would also not recommend it to anyone including new users. A running Nutch almost always requires some scripting here and there, cron jobs, locks etc. I propose (most likely a challenging statement) to deprecate the crawl command in 1.4. Users, developers, please comment. Thanks -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350 - Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com Co-Author: Solr 1.4 Enterprise Search Server available from http://www.packtpub.com/solr-1-4-enterprise-search-server This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.