I figured out what was wrong with the bin/nutch parse. On 0.9 version this doesn't work even if you use -noParsing on the bin/nutch fetch2 operation. So this I think is a bug.
With the latest trunk version everything is fine! May it would be time to release version 0.9.1? ;) cheers martin Zitat von Martin Kammerlander <[EMAIL PROTECTED]>: > Ok thx for your help...yes when I don't use 'parse' then it seems not to get > me > the outlinks. > > ok I now fetch the segment like that: > > bin/nutch fetch crawl/segments/current_segment_name -threads 10 > > > After that I try to parse the segement like that: > > bin/nutch parse crawl/segments/current_segment_name > > But the I get the command line error in nutch 0.8.1 as well as in 0.9: > > Exception in thread "main" java.io.IOException: Segment already parsed! > at > org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputFormat.java:49) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:279) > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327) > at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:120) > at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:138) > > the hadoop.log file doesn't trow me an error at all. > > What is this error? Does fetch indeed already parse me the segment or do I > really have to use parse after fetch operation? > > regards > martin > > > Zitat von Marc Brette <[EMAIL PROTECTED]>: > > > I don't know about 0.8.1 but I think that with 0.9: > > - the generate call will extract the first 50 URL in the list. However, in > > your case, the URLs in the DB are only the ones in your testurl folder. The > > outlinks have not been extracted yet. This won't occur until you call > parse. > > - I am not 100% sure but I think the generate call will generate one > > segment. However a segment can contain as many URL (or fetchlist if you > > want) as your topN parameter > > - the fetch call will indeed distribute the URLs in your segment to your 2 > > threads. > > > > On Thu, May 22, 2008 at 6:45 PM, Martin Kammerlander < > > [EMAIL PROTECTED]> wrote: > > > > > > > > Hi > > > > > > I have some questions because some things are not that clear to me (<-- > > > newbie > > > :P ) > > > > > > I'm using nutch 0.8.1 currently. > > > > > > first: > > > > > > bin/nutch inject crawl/crawldb testurl/ > > > bin/nutch generate crawl/crawldb crawl/segments -topN 50 -numFetchers 10 > > > > > > first one injects the seed urls in the WebDB. > > > second one: Now this creates me one single segment right out of the seed > > > URLs. > > > numFetchers is deprecated and not in use anymore as far as I > > > understood...so it > > > seems to have no effect. > > > > > > Now my questions: does it always just generate me one single segment or > can > > > there be more...if there can be more segments on what depends the number > of > > > created segments? > > > > > > Does one single segment contain multiple fetchlists? > > > > > > Furtermore: -topN 50 means that if I have for example one single seed url > > > and if > > > this page is fetched and contains lets say 100 outlinks then only 50 of > > > them > > > will be considered to be parsed in the next iteration. Is this correct? > > > > > > second: > > > > > > bin/nutch fetch crawl/segments/segment_number -threads 10 > > > > > > At the moment I assume, and based on testing it should be like that, that > > > by > > > running bin/generate one single segment is created. that single segment > > > contains various fetchlists. Those fetchlists afterwards on the operation > > > bin/fetch are splitted on the different threads (example: segment > contains > > > 2 > > > fetchlists and we have 2 threads then 1 thread gets one list and second > > > thread > > > gets the other list) Is this right? > > > > > > > > > thx for your help! > > > > > > best regards > > > martin > > > > > > > > > > > > > >
