can anyone tell me how to control how many segments get created by generate? I tried mergesegs, but that always seems to crash.
On Dec 22, 2007 12:44 PM, Josh Attenberg <[EMAIL PROTECTED]> wrote: > regarding point #1 - in nutch 7 & 8, how can one specify the number of > urls that go into a segment? I guess an equivalent question is, how to > control how many segments get created by generate? > > On Dec 19, 2007 6:23 PM, Dennis Kubes <[EMAIL PROTECTED]> wrote: > > > you can delete everything in segments except crawl_generate then start > > the fetch again. Here are some hints about fetching: > > > > 1) if you can do smaller fetches in the 1-5 million page range, then > > aggregate together. that way if something goes wrong you haven't lost > > alot. > > > > 2) setting generate.max.per.host to some a small number instead of -1 > > will make fetches run faster but won't get all pages from a single site. > > This is good if you are doing general web crawls. > > > > 3) avoid using regex-urlfilter if possible, prefix and suffix url filter > > tend to work much better and don't cause stalls. > > > > 4) setting fetcher.max.crawl.delay (default is 30 seconds) will avoid > > pages with long crawl delays. again this makes fetching go faster. > > > > Hope this helps. > > > > Dennis Kubes > > > > Josh Attenberg wrote: > > > I was fetching a large segment, ~30million urls, when something weird > > > happened and the fetcher crashed. I know I can't recover the portion > > of what > > > I have fetched already, but i'd like to start over. When I try to do > > this, > > > I get the following error: already exists: > > segments/20071214095336/fetcher > > > > > > what can I do to re-try this? can i just delete that file and try > > over? I > > > have had nothing but trouble performing large fetches with nutch, but > > I > > > can't give up! please help! > > > > > > >
