fetch

Martin Kammerlander Fri, 30 May 2008 04:47:10 -0700

I figured out what was wrong with the bin/nutch parse. On 0.9 version this
doesn't work even if you use -noParsing on the bin/nutch fetch2 operation. So
this I think is a bug.


With the latest trunk version everything is fine! May it would be time to
release version 0.9.1? ;)

cheers
martin


Zitat von Martin Kammerlander <[EMAIL PROTECTED]>:

> Ok thx for your help...yes when I don't use 'parse' then it seems not to get
> me
> the outlinks.
>
> ok I now fetch the segment like that:
>
> bin/nutch fetch crawl/segments/current_segment_name -threads 10
>
>
> After that I try to parse the segement like that:
>
> bin/nutch parse crawl/segments/current_segment_name
>
> But the I get the command line error in nutch 0.8.1 as well as in 0.9:
>
> Exception in thread "main" java.io.IOException: Segment already parsed!
>       at
>
org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputFormat.java:49)
>       at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:279)
>       at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
>       at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:120)
>       at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:138)
>
> the hadoop.log file doesn't trow me an error at all.
>
> What is this error? Does fetch indeed already parse me the segment or do I
> really have to use parse after fetch operation?
>
> regards
> martin
>
>
> Zitat von Marc Brette <[EMAIL PROTECTED]>:
>
> > I don't know about 0.8.1 but I think that with 0.9:
> > - the generate call will extract the first 50 URL in the list. However, in
> > your case, the URLs in the DB are only the ones in your testurl folder. The
> > outlinks have not been extracted yet. This won't occur until you call
> parse.
> > - I am not 100% sure but I think the generate call will generate one
> > segment. However a segment can contain as many URL (or fetchlist if you
> > want) as your topN parameter
> > - the fetch call will indeed distribute the URLs in your segment to your 2
> > threads.
> >
> > On Thu, May 22, 2008 at 6:45 PM, Martin Kammerlander <
> > [EMAIL PROTECTED]> wrote:
> >
> > >
> > > Hi
> > >
> > > I have some questions because some things are not that clear to me (<--
> > > newbie
> > > :P )
> > >
> > > I'm using nutch 0.8.1 currently.
> > >
> > > first:
> > >
> > > bin/nutch inject crawl/crawldb testurl/
> > > bin/nutch generate crawl/crawldb crawl/segments -topN 50 -numFetchers 10
> > >
> > > first one injects the seed urls in the WebDB.
> > > second one: Now this creates me one single segment right out of the seed
> > > URLs.
> > > numFetchers is deprecated and not in use anymore as far as I
> > > understood...so it
> > > seems to have no effect.
> > >
> > > Now my questions: does it always just generate me one single segment or
> can
> > > there be more...if there can be more segments on what depends the number
> of
> > > created segments?
> > >
> > > Does one single segment contain multiple fetchlists?
> > >
> > > Furtermore: -topN 50 means that if I have for example one single seed url
> > > and if
> > > this page is fetched and contains lets say 100 outlinks then only 50 of
> > > them
> > > will be considered to be parsed in the next iteration. Is this correct?
> > >
> > > second:
> > >
> > > bin/nutch fetch crawl/segments/segment_number -threads 10
> > >
> > > At the moment I assume, and based on testing it should be like that, that
> > > by
> > > running bin/generate one single segment is created. that single segment
> > > contains various fetchlists. Those fetchlists afterwards on the operation
> > > bin/fetch are splitted on the different threads (example: segment
> contains
> > > 2
> > > fetchlists and we have 2 threads then 1 thread gets one list and second
> > > thread
> > > gets the other list) Is this right?
> > >
> > >
> > > thx for your help!
> > >
> > > best regards
> > > martin
> > >
> > >
> > >
> >
>
>
>

Re: question: bin/generate and segments, /bin/fetch

Reply via email to