You're right about the "single site" thing, but I think just
changing the title and adding a bit more of an explanation should do the
trick.  I went ahead and put up a version of the tutorial on the wiki.
I haven't changed it other than to try to get the formatting similar to
what's on the current tutorial.  Feel free to edit.

http://wiki.apache.org/nutch/NutchTutorial

Thanks,
Jake.

-----Original Message-----
From: Franz Werfel [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, March 07, 2006 10:11 AM
To: nutch-user@lucene.apache.org
Subject: Re: project vitality? / less documentation is more!

Hello,

"single site crawling" wouldn't address the confusion that results
from the fact that the 'crawl' command is actually the concatenation
of several commands; and it would not be true either, since you can do
"several sites crawling" with 'crawl'.

But I have to agree that it helps "getting up and running quickly";
however my point is that, after this first phase, it is _more_
difficult to go to the next phase than if one hadn't used this first
step first...

Maybe at the end of the tutorial for "Intranet crawling" the following
sentence could be added:
"If you want to crawl the same site _again_, use the whole-web
tutorial below, and NOT the crawl command."

Also, the sentence "Whole-web crawling is designed to handle very
large crawls which may take weeks to complete, running on multiple
machines" is misleading, since one has to use whole-web crawling to
fine-tune or recrawl even the smallest of websites.

The distinction is not only on the scale of the project, but on the
level of control one wants (IMHO). The documentation should at least
give hints in that direction.

Thanks, Frank.



On 3/7/06, Vanderdray, Jacob <[EMAIL PROTECTED]> wrote:
> -1
>
>        I found the instructions for doing an "Intranet crawl"
extremely
> helpful for getting up and running quickly.  I went back later and
> figured out more about what it was actually doing.  Perhaps the name
> could just be changed to "Single Site Crawling with the Nutch Shell
> Script" and some explanatory text could be added.
>
>        I'll try to take the time today to put a version of the
tutorial
> on the wiki that does that.  Then if folks agree, I'll put together a
> patch that changes the site links for the tutorial to point at the
wiki.
>
> Thanks,
> Jake.
>
> -----Original Message-----
> From: Franz Werfel [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, March 07, 2006 3:01 AM
> To: nutch-user@lucene.apache.org
> Subject: Re: project vitality? / less documentation is more!
>
> Hello,
>
> Just my 2 cents: the "Intranet crawl" functionnality is VERY
confusing.
>
> If it was just taken out of the tutorial, and out of the set of
> commands, that would actually help A LOT: I understood many many
> things about Nutch once I tried so-called whole-web crawling, where
> one has to use every command one at a time. And that would also
> eliminate all the questions about "how to recrawl", etc.
>
> Or maybe a change of name would be enough: "Intranet crawl" could be
> called "fast-setup crawl", and "whole-web crawling", "serious crawling
> for Intranet or whole-web projects".
>
> What do you think?
>
> Thanks, Frank.
>

Reply via email to