Hi, let me say if I injected 5 urls into db, then generated segments for them, then I run fetch to get the pages... in this procedure, where I can define the depth of links to go such 1, 2 ...etc, and if I do not define the depth, how the fetch will behave ? what would be the results of following: nutch generate db segments -topN 5 and nutch generate db segments
in above, the first one will only have 5 url pages (depth of level 0, just itself) in segment to be fetched? and the second one will have more url pages than 5 (goes more than depth of level 0 ) to be fetched? Thank you very much, Jenny eyal edri <[EMAIL PROTECTED]> wrote: Hi, you have 2 options, 1. use the crawl command (mainly for intranet use) with the -depth arg 2. (you're using the break down commands: inject-->generate-->fetch...) - write a small bash script to do that (avaliable on the nutch wiki). hope this helps. eyal. On 9/5/07, Jenny LIU wrote: > > When I do fetch, nutch only give me the depth level 0 which is the > website home page, how I can get the nutch fetch deeper than that which > means it can go to the links in the home page and fetch those pages also? > Any idea please? > > Thanks a lot, > > Jenny > > Carl Cerecke wrote: > I have solved this problem by opening the index using > org.apache.lucene.index.IndexReader to read all the Documents contained > therein and creating a map from url to segment and document id. I can > then use SegmentReader to get the contents for that url. > > Ugly, but it works. > Carl. > > Carl Cerecke wrote: > > This looks like it should work, but how can I get lucene-search to do an > > exact match for the URL? > > > > I've tried: > > bin/nutch org.apache.nutch.searcher.NutchBean url: > > but I can't get it to work accurately no matter how I mangle and quote > > what I put in > > > > I've tried Luke also, but I can't get that to exactly match a url > > either. Despite the fact that when searching for url:foo I can see, > > among the matches, http://www.foo.co.nz, I don't seem to be able to > > specifically match that (and only that) url in the general case. > > > > Perhaps it is because the url is parsed into bits and not indexed as a > > whole string, including punctuation? This is despite the fact that the > > punctuation seems very much preserved intact in the index file > > crawl/indexes/part-00000/_mq0.fdt > > > > To work around this, I notice that all documents indexed have a document > > ID. If I could map the url to the document ID, and from there get the > > document, then that would be suitable. Any ideas? > > > > Cheers, > > Carl. > > > > Robeyns Bart wrote: > >> The segment is recorded as a field in the Lucene-index. One easy way > >> to do it would be to: > >> - do a Lucene-search for the url, - read the "segment"-field from the > >> resulting Lucene-Document and - call SegmentReader with this value as > >> the segment-argument. > >> > >> Bart Robeyns > >> Panoptic > >> > >> > >> > >> > >> -----Original Message----- > >> From: Carl Cerecke [mailto:[EMAIL PROTECTED] > >> Sent: Thu 8/30/2007 6:30 > >> To: [email protected] > >> Subject: Getting page information given the URL > >> > >> Hi, > >> > >> How do I get the page information from whichever segment it is in, > >> given a URL? > >> > >> I'm basically looking for a class to use from the command-line which, > >> given an index and a url, returns me the information for that url from > >> whichever segment it is in. Similar to SegmentReader -get, but without > >> having to specify the segment. > >> > >> This seems like it should be relatively simple to do, but it has > >> evaded me thus far... > >> > >> Is the best approach to merge all the segments (hundreds of them) into > >> one big segment? Would this work? What would the performance be like > >> for this approach? > >> > >> Cheers, > >> Carl. > >> > >> > >> > >> _____________________________________________________________________ > >> > >> This has been cleaned & processed by www.rocketspam.co.nz > >> _____________________________________________________________________ > > > > > > _____________________________________________________________________ > > > > This has been cleaned & processed by www.rocketspam.co.nz > > _____________________________________________________________________ > > > > > > > --------------------------------- > Be a better Globetrotter. Get better travel answers from someone who > knows. > Yahoo! Answers - Check it out. -- Eyal Edri --------------------------------- Be a better Heartthrob. Get better relationship answers from someone who knows. Yahoo! Answers - Check it out.
