Nutch generates a full list of URLs that are to be fetched (with the generate command) before the fetch process begins, so there isn't really a concept of a seed url. If you have a single URL and begin a fetch process with 10 threads, only the single URL with be fetched. Later the link database is updated with links from that one page and through multiple iterations of the crawl process, other pages are then fetched.
I suggest you look over the Nutch Tutorial at http://wiki.apache.org/nutch/NutchTutorial . The crawl command is basically a wrapper allowing for multiple iterations of the process described in the step-by-step or whole-web-crawling technique, so it would be good to understand how the step-by-step process works. I've also recently looked over the Fetch.java code, and what I can't figure out is how a specific FetcherTread times out on a particular URL. It looks to me that if a URL is just taking an extremely long time, it locks up the thread. On 8/4/06, jian chen <[EMAIL PROTECTED]> wrote: > Hi, > > I just want to understand how fetcher threads got terminated. Looking at the > Fetcher.java, it seems to me that the fetcher thread just exits if there is > no url to fetch. > > Now, if I initialize the system with 10 fetcher threads and only 1 seed url. > If the seed url is removed by thread 1 and in the process of generating more > urls. The other 9 fetcher threads can look into the url queue and see no > urls available. According to the current logic, these 9 threads will > exit/terminate. Right? > > Thus, even though you can specify 10 threads for fetching, in practice, the > system could be left with only one thread running all the time, defeating > the purpose of multi-threading, no? > > Could some one enlighten me more about how nutch works in this regard? > > Thanks a lot! > > Jian > > -- http://JacobBrunson.com ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
