thanks, but after trying with nutch 7, 8, and 9 for several months, I have never had any fetches that havent crashed, seemingly irreparably before the third iteration.
Often, the fetch completes, but the program remains running, for a long time. if i kill it, then the crawl is ruined, but it seems to run forever, an endless loop. On Dec 19, 2007 6:23 PM, Dennis Kubes <[EMAIL PROTECTED]> wrote: > you can delete everything in segments except crawl_generate then start > the fetch again. Here are some hints about fetching: > > 1) if you can do smaller fetches in the 1-5 million page range, then > aggregate together. that way if something goes wrong you haven't lost > alot. > > 2) setting generate.max.per.host to some a small number instead of -1 > will make fetches run faster but won't get all pages from a single site. > This is good if you are doing general web crawls. > > 3) avoid using regex-urlfilter if possible, prefix and suffix url filter > tend to work much better and don't cause stalls. > > 4) setting fetcher.max.crawl.delay (default is 30 seconds) will avoid > pages with long crawl delays. again this makes fetching go faster. > > Hope this helps. > > Dennis Kubes > > Josh Attenberg wrote: > > I was fetching a large segment, ~30million urls, when something weird > > happened and the fetcher crashed. I know I can't recover the portion of > what > > I have fetched already, but i'd like to start over. When I try to do > this, > > I get the following error: already exists: > segments/20071214095336/fetcher > > > > what can I do to re-try this? can i just delete that file and try over? > I > > have had nothing but trouble performing large fetches with nutch, but I > > can't give up! please help! > > >
