When I do fetch, nutch only give me the depth level 0 which is the website home page, how I can get the nutch fetch deeper than that which means it can go to the links in the home page and fetch those pages also? Any idea please? Thanks a lot, Jenny
Carl Cerecke <[EMAIL PROTECTED]> wrote: I have solved this problem by opening the index using org.apache.lucene.index.IndexReader to read all the Documents contained therein and creating a map from url to segment and document id. I can then use SegmentReader to get the contents for that url. Ugly, but it works. Carl. Carl Cerecke wrote: > This looks like it should work, but how can I get lucene-search to do an > exact match for the URL? > > I've tried: > bin/nutch org.apache.nutch.searcher.NutchBean url: > but I can't get it to work accurately no matter how I mangle and quote > what I put in > > I've tried Luke also, but I can't get that to exactly match a url > either. Despite the fact that when searching for url:foo I can see, > among the matches, http://www.foo.co.nz, I don't seem to be able to > specifically match that (and only that) url in the general case. > > Perhaps it is because the url is parsed into bits and not indexed as a > whole string, including punctuation? This is despite the fact that the > punctuation seems very much preserved intact in the index file > crawl/indexes/part-00000/_mq0.fdt > > To work around this, I notice that all documents indexed have a document > ID. If I could map the url to the document ID, and from there get the > document, then that would be suitable. Any ideas? > > Cheers, > Carl. > > Robeyns Bart wrote: >> The segment is recorded as a field in the Lucene-index. One easy way >> to do it would be to: >> - do a Lucene-search for the url, - read the "segment"-field from the >> resulting Lucene-Document and - call SegmentReader with this value as >> the segment-argument. >> >> Bart Robeyns >> Panoptic >> >> >> >> >> -----Original Message----- >> From: Carl Cerecke [mailto:[EMAIL PROTECTED] >> Sent: Thu 8/30/2007 6:30 >> To: [email protected] >> Subject: Getting page information given the URL >> >> Hi, >> >> How do I get the page information from whichever segment it is in, >> given a URL? >> >> I'm basically looking for a class to use from the command-line which, >> given an index and a url, returns me the information for that url from >> whichever segment it is in. Similar to SegmentReader -get, but without >> having to specify the segment. >> >> This seems like it should be relatively simple to do, but it has >> evaded me thus far... >> >> Is the best approach to merge all the segments (hundreds of them) into >> one big segment? Would this work? What would the performance be like >> for this approach? >> >> Cheers, >> Carl. >> >> >> >> _____________________________________________________________________ >> >> This has been cleaned & processed by www.rocketspam.co.nz >> _____________________________________________________________________ > > > _____________________________________________________________________ > > This has been cleaned & processed by www.rocketspam.co.nz > _____________________________________________________________________ > --------------------------------- Be a better Globetrotter. Get better travel answers from someone who knows. Yahoo! Answers - Check it out.
