Re: how to fetch the websites with the depth level 2 links

Jenny LIU Wed, 05 Sep 2007 13:49:44 -0700

Hi,
   
  let me say if I injected 5 urls into db, then generated segments for them, 
then I run fetch to get the pages... in this procedure, where I can define the 
depth of links to go such 1, 2 ...etc, and if I do not define the depth, how 
the fetch will behave ?
  what would be the results of following:
  nutch generate db segments -topN 5 and
  nutch generate db segments


  in above, the first one will only have 5 url pages (depth of level 0, just 
itself) in segment to be fetched?
  and the second one will have more url pages than 5 (goes more than depth of 
level 0 ) to be fetched?
   
  Thank you very much,
   
  Jenny
  
eyal edri <[EMAIL PROTECTED]> wrote:
  Hi,

you have 2 options,
1. use the crawl command (mainly for intranet use) with the -depth arg
2. (you're using the break down commands: inject-->generate-->fetch...) -
write a small bash script to do that (avaliable on the nutch wiki).

hope this helps.

eyal.

On 9/5/07, Jenny LIU wrote:
>
> When I do fetch, nutch only give me the depth level 0 which is the
> website home page, how I can get the nutch fetch deeper than that which
> means it can go to the links in the home page and fetch those pages also?
> Any idea please?
>
> Thanks a lot,
>
> Jenny
>
> Carl Cerecke wrote:
> I have solved this problem by opening the index using
> org.apache.lucene.index.IndexReader to read all the Documents contained
> therein and creating a map from url to segment and document id. I can
> then use SegmentReader to get the contents for that url.
>
> Ugly, but it works.
> Carl.
>
> Carl Cerecke wrote:
> > This looks like it should work, but how can I get lucene-search to do an
> > exact match for the URL?
> >
> > I've tried:
> > bin/nutch org.apache.nutch.searcher.NutchBean url:
> > but I can't get it to work accurately no matter how I mangle and quote
> > what I put in
> >
> > I've tried Luke also, but I can't get that to exactly match a url
> > either. Despite the fact that when searching for url:foo I can see,
> > among the matches, http://www.foo.co.nz, I don't seem to be able to
> > specifically match that (and only that) url in the general case.
> >
> > Perhaps it is because the url is parsed into bits and not indexed as a
> > whole string, including punctuation? This is despite the fact that the
> > punctuation seems very much preserved intact in the index file
> > crawl/indexes/part-00000/_mq0.fdt
> >
> > To work around this, I notice that all documents indexed have a document
> > ID. If I could map the url to the document ID, and from there get the
> > document, then that would be suitable. Any ideas?
> >
> > Cheers,
> > Carl.
> >
> > Robeyns Bart wrote:
> >> The segment is recorded as a field in the Lucene-index. One easy way
> >> to do it would be to:
> >> - do a Lucene-search for the url, - read the "segment"-field from the
> >> resulting Lucene-Document and - call SegmentReader with this value as
> >> the segment-argument.
> >>
> >> Bart Robeyns
> >> Panoptic
> >>
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Carl Cerecke [mailto:[EMAIL PROTECTED]
> >> Sent: Thu 8/30/2007 6:30
> >> To: [email protected]
> >> Subject: Getting page information given the URL
> >>
> >> Hi,
> >>
> >> How do I get the page information from whichever segment it is in,
> >> given a URL?
> >>
> >> I'm basically looking for a class to use from the command-line which,
> >> given an index and a url, returns me the information for that url from
> >> whichever segment it is in. Similar to SegmentReader -get, but without
> >> having to specify the segment.
> >>
> >> This seems like it should be relatively simple to do, but it has
> >> evaded me thus far...
> >>
> >> Is the best approach to merge all the segments (hundreds of them) into
> >> one big segment? Would this work? What would the performance be like
> >> for this approach?
> >>
> >> Cheers,
> >> Carl.
> >>
> >>
> >>
> >> _____________________________________________________________________
> >>
> >> This has been cleaned & processed by www.rocketspam.co.nz
> >> _____________________________________________________________________
> >
> >
> > _____________________________________________________________________
> >
> > This has been cleaned & processed by www.rocketspam.co.nz
> > _____________________________________________________________________
> >
>
>
>
>
> ---------------------------------
> Be a better Globetrotter. Get better travel answers from someone who
> knows.
> Yahoo! Answers - Check it out.




-- 
Eyal Edri


       
---------------------------------
Be a better Heartthrob. Get better relationship answers from someone who knows.
Yahoo! Answers - Check it out.

Re: how to fetch the websites with the depth level 2 links

Reply via email to