No, I think you're right that indexing is done automatically after intranet crawls. Just try "bin/nutch index yourSegment", if it says that 'index.done exists already',then well.. you get the point. I don't know what platform you're using, but try doing a "grep -r <some text in your crawled site> *". The grep command should match on both your segment data and the binary index that have been built. I have run in to a similar problem, where the Websearch thingie does not work, but a manual search using the IndexSearcher class does work. Also, try opening your index from the LUKE program if you haven't already. It's a very handy tool for validating and test-searching your data.
Good luck, Fredrik On 7/23/05, Feng (Michael) Ji <[EMAIL PROTECTED]> wrote: > hi Fredrik: > > Actually, I use nutch/crawl command as following: > " > bin/nutch crawl urls -dir crawl-s -depth 1 >& > crawl-s.log > " > I guess I don't need to do index explicitly after > crawl. Is it right? > > My sample crawling doesn't go deeply and only stop at > the home page of the URL. > > I guess the -depth is defined for crawling the website > which is pointed out from initial page, is it right? > > One thing I found, the result in /segments/ will have > the same number of sub-dir (which are all time stamped > number) as the -depth parameter, > > thanks a lot, > > Michael, > > --- Fredrik Andersson <[EMAIL PROTECTED]> > wrote: > > > Hi Michael. > > > > Have you indexed the crawl/segment? Easy to forget > > sometimes : ) Also, > > check the crawler-tools.xml or whatever it's called, > > so that ASP pages > > aren't blocked or anything. The Nutch crawler > > doesn't by default > > handle parameters (committees.asp?viewPerson=Ji), I > > guess that could > > be an issue as well. No errors or funny stuff in the > > logs? > > > > Fredrik > > > > On 7/23/05, Feng (Michael) Ji <[EMAIL PROTECTED]> > > wrote: > > > Hi there: > > > > > > I have a question about the crawling depth VS > > search > > > result. I attached part of my log information; > > > > > > " > > > 050722 181508 fetching > > > http://www.committemuse.com/content/committees.asp > > > : > > > : > > > 050722 181508 fetching > > > : > > > 050722 181508 status: segment 20050722181440, 100 > > > pages, 4 errors, 1952888 bytes, 26204 ms > > > " > > > > > > And I see segment in my tomcat box. > > > > > > But when I do search the specific word in that > > page, > > > it return 0. > > > > > > Is that because the page is written in "asp"? > > > > > > thanks, > > > > > > Michael, > > > > > > > > > > > > __________________________________________________ > > > Do You Yahoo!? > > > Tired of spam? Yahoo! Mail has the best spam > > protection around > > > http://mail.yahoo.com > > > > > > > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam protection around > http://mail.yahoo.com > ------------------------------------------------------- SF.Net email is sponsored by: Discover Easy Linux Migration Strategies from IBM. Find simple to follow Roadmaps, straightforward articles, informative Webcasts and more! Get everything you need to get up to speed, fast. http://ads.osdn.com/?ad_idt77&alloc_id492&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
