You're right about the "single site" thing, but I think just changing the title and adding a bit more of an explanation should do the trick. I went ahead and put up a version of the tutorial on the wiki. I haven't changed it other than to try to get the formatting similar to what's on the current tutorial. Feel free to edit.
http://wiki.apache.org/nutch/NutchTutorial Thanks, Jake. -----Original Message----- From: Franz Werfel [mailto:[EMAIL PROTECTED] Sent: Tuesday, March 07, 2006 10:11 AM To: nutch-user@lucene.apache.org Subject: Re: project vitality? / less documentation is more! Hello, "single site crawling" wouldn't address the confusion that results from the fact that the 'crawl' command is actually the concatenation of several commands; and it would not be true either, since you can do "several sites crawling" with 'crawl'. But I have to agree that it helps "getting up and running quickly"; however my point is that, after this first phase, it is _more_ difficult to go to the next phase than if one hadn't used this first step first... Maybe at the end of the tutorial for "Intranet crawling" the following sentence could be added: "If you want to crawl the same site _again_, use the whole-web tutorial below, and NOT the crawl command." Also, the sentence "Whole-web crawling is designed to handle very large crawls which may take weeks to complete, running on multiple machines" is misleading, since one has to use whole-web crawling to fine-tune or recrawl even the smallest of websites. The distinction is not only on the scale of the project, but on the level of control one wants (IMHO). The documentation should at least give hints in that direction. Thanks, Frank. On 3/7/06, Vanderdray, Jacob <[EMAIL PROTECTED]> wrote: > -1 > > I found the instructions for doing an "Intranet crawl" extremely > helpful for getting up and running quickly. I went back later and > figured out more about what it was actually doing. Perhaps the name > could just be changed to "Single Site Crawling with the Nutch Shell > Script" and some explanatory text could be added. > > I'll try to take the time today to put a version of the tutorial > on the wiki that does that. Then if folks agree, I'll put together a > patch that changes the site links for the tutorial to point at the wiki. > > Thanks, > Jake. > > -----Original Message----- > From: Franz Werfel [mailto:[EMAIL PROTECTED] > Sent: Tuesday, March 07, 2006 3:01 AM > To: nutch-user@lucene.apache.org > Subject: Re: project vitality? / less documentation is more! > > Hello, > > Just my 2 cents: the "Intranet crawl" functionnality is VERY confusing. > > If it was just taken out of the tutorial, and out of the set of > commands, that would actually help A LOT: I understood many many > things about Nutch once I tried so-called whole-web crawling, where > one has to use every command one at a time. And that would also > eliminate all the questions about "how to recrawl", etc. > > Or maybe a change of name would be enough: "Intranet crawl" could be > called "fast-setup crawl", and "whole-web crawling", "serious crawling > for Intranet or whole-web projects". > > What do you think? > > Thanks, Frank. >