but then I will be missing many URLs, no? I need a big crawl, I wish I could split big fetch tasks in to many smaller fetches.
On Dec 28, 2007 11:57 AM, Dennis Kubes <[EMAIL PROTECTED]> wrote: > There will always be a single segments folder. Under that if using the > DFS there should be one part-xxxxx folder in each of the segments > sub-folders (i.e. content) for each reduce task. You can control the > number of urls to fetch with the -topN option on the command line. > > Dennis Kubes > > Josh Attenberg wrote: > > can anyone tell me how to control how many segments get created by > generate? > > I tried mergesegs, but that always seems to crash. > > > > On Dec 22, 2007 12:44 PM, Josh Attenberg <[EMAIL PROTECTED]> > wrote: > > > >> regarding point #1 - in nutch 7 & 8, how can one specify the number of > >> urls that go into a segment? I guess an equivalent question is, how to > >> control how many segments get created by generate? > >> > >> On Dec 19, 2007 6:23 PM, Dennis Kubes <[EMAIL PROTECTED]> wrote: > >> > >>> you can delete everything in segments except crawl_generate then start > >>> the fetch again. Here are some hints about fetching: > >>> > >>> 1) if you can do smaller fetches in the 1-5 million page range, then > >>> aggregate together. that way if something goes wrong you haven't lost > >>> alot. > >>> > >>> 2) setting generate.max.per.host to some a small number instead of -1 > >>> will make fetches run faster but won't get all pages from a single > site. > >>> This is good if you are doing general web crawls. > >>> > >>> 3) avoid using regex-urlfilter if possible, prefix and suffix url > filter > >>> tend to work much better and don't cause stalls. > >>> > >>> 4) setting fetcher.max.crawl.delay (default is 30 seconds) will avoid > >>> pages with long crawl delays. again this makes fetching go faster. > >>> > >>> Hope this helps. > >>> > >>> Dennis Kubes > >>> > >>> Josh Attenberg wrote: > >>>> I was fetching a large segment, ~30million urls, when something weird > >>>> happened and the fetcher crashed. I know I can't recover the portion > >>> of what > >>>> I have fetched already, but i'd like to start over. When I try to do > >>> this, > >>>> I get the following error: already exists: > >>> segments/20071214095336/fetcher > >>>> what can I do to re-try this? can i just delete that file and try > >>> over? I > >>>> have had nothing but trouble performing large fetches with nutch, but > >>> I > >>>> can't give up! please help! > >>>> > >> > > >
