+1 -----Original Message----- From: Franz Werfel [mailto:[EMAIL PROTECTED] Sent: Tuesday, March 07, 2006 10:11 AM To: nutch-user@lucene.apache.org Subject: Re: project vitality? / less documentation is more!
Hello, "single site crawling" wouldn't address the confusion that results from the fact that the 'crawl' command is actually the concatenation of several commands; and it would not be true either, since you can do "several sites crawling" with 'crawl'. But I have to agree that it helps "getting up and running quickly"; however my point is that, after this first phase, it is _more_ difficult to go to the next phase than if one hadn't used this first step first... Maybe at the end of the tutorial for "Intranet crawling" the following sentence could be added: "If you want to crawl the same site _again_, use the whole-web tutorial below, and NOT the crawl command." Also, the sentence "Whole-web crawling is designed to handle very large crawls which may take weeks to complete, running on multiple machines" is misleading, since one has to use whole-web crawling to fine-tune or recrawl even the smallest of websites. The distinction is not only on the scale of the project, but on the level of control one wants (IMHO). The documentation should at least give hints in that direction. Thanks, Frank. On 3/7/06, Vanderdray, Jacob <[EMAIL PROTECTED]> wrote: > -1 > > I found the instructions for doing an "Intranet crawl" > extremely helpful for getting up and running quickly. I went back > later and figured out more about what it was actually doing. Perhaps > the name could just be changed to "Single Site Crawling with the Nutch > Shell Script" and some explanatory text could be added. > > I'll try to take the time today to put a version of the > tutorial on the wiki that does that. Then if folks agree, I'll put > together a patch that changes the site links for the tutorial to point > at the wiki. > > Thanks, > Jake. > > -----Original Message----- > From: Franz Werfel [mailto:[EMAIL PROTECTED] > Sent: Tuesday, March 07, 2006 3:01 AM > To: nutch-user@lucene.apache.org > Subject: Re: project vitality? / less documentation is more! > > Hello, > > Just my 2 cents: the "Intranet crawl" functionnality is VERY > confusing. > > If it was just taken out of the tutorial, and out of the set of > commands, that would actually help A LOT: I understood many many > things about Nutch once I tried so-called whole-web crawling, where > one has to use every command one at a time. And that would also > eliminate all the questions about "how to recrawl", etc. > > Or maybe a change of name would be enough: "Intranet crawl" could be > called "fast-setup crawl", and "whole-web crawling", "serious crawling > for Intranet or whole-web projects". > > What do you think? > > Thanks, Frank. >