hi Fredrik: After I did crawling in Nutch, I copy segments to root of tomcat.
I wonder if I need to do the same thing for index and db directory. thanks, Michael, --- Fredrik Andersson <[EMAIL PROTECTED]> wrote: > No, I think you're right that indexing is done > automatically after > intranet crawls. Just try "bin/nutch index > yourSegment", if it says > that 'index.done exists already',then well.. you get > the point. I > don't know what platform you're using, but try doing > a "grep -r <some > text in your crawled site> *". The grep command > should match on both > your segment data and the binary index that have > been built. > I have run in to a similar problem, where the > Websearch thingie does > not work, but a manual search using the > IndexSearcher class does work. > Also, try opening your index from the LUKE program > if you haven't > already. It's a very handy tool for validating and > test-searching your > data. > > Good luck, > Fredrik > > On 7/23/05, Feng (Michael) Ji <[EMAIL PROTECTED]> > wrote: > > hi Fredrik: > > > > Actually, I use nutch/crawl command as following: > > " > > bin/nutch crawl urls -dir crawl-s -depth 1 >& > > crawl-s.log > > " > > I guess I don't need to do index explicitly after > > crawl. Is it right? > > > > My sample crawling doesn't go deeply and only stop > at > > the home page of the URL. > > > > I guess the -depth is defined for crawling the > website > > which is pointed out from initial page, is it > right? > > > > One thing I found, the result in /segments/ will > have > > the same number of sub-dir (which are all time > stamped > > number) as the -depth parameter, > > > > thanks a lot, > > > > Michael, > > > > --- Fredrik Andersson <[EMAIL PROTECTED]> > > wrote: > > > > > Hi Michael. > > > > > > Have you indexed the crawl/segment? Easy to > forget > > > sometimes : ) Also, > > > check the crawler-tools.xml or whatever it's > called, > > > so that ASP pages > > > aren't blocked or anything. The Nutch crawler > > > doesn't by default > > > handle parameters > (committees.asp?viewPerson=Ji), I > > > guess that could > > > be an issue as well. No errors or funny stuff in > the > > > logs? > > > > > > Fredrik > > > > > > On 7/23/05, Feng (Michael) Ji <[EMAIL PROTECTED]> > > > wrote: > > > > Hi there: > > > > > > > > I have a question about the crawling depth VS > > > search > > > > result. I attached part of my log information; > > > > > > > > " > > > > 050722 181508 fetching > > > > > http://www.committemuse.com/content/committees.asp > > > > : > > > > : > > > > 050722 181508 fetching > > > > : > > > > 050722 181508 status: segment 20050722181440, > 100 > > > > pages, 4 errors, 1952888 bytes, 26204 ms > > > > " > > > > > > > > And I see segment in my tomcat box. > > > > > > > > But when I do search the specific word in that > > > page, > > > > it return 0. > > > > > > > > Is that because the page is written in "asp"? > > > > > > > > thanks, > > > > > > > > Michael, > > > > > > > > > > > > > > > > > __________________________________________________ > > > > Do You Yahoo!? > > > > Tired of spam? Yahoo! Mail has the best spam > > > protection around > > > > http://mail.yahoo.com > > > > > > > > > > > > > __________________________________________________ > > Do You Yahoo!? > > Tired of spam? Yahoo! Mail has the best spam > protection around > > http://mail.yahoo.com > > > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com ------------------------------------------------------- SF.Net email is sponsored by: Discover Easy Linux Migration Strategies from IBM. Find simple to follow Roadmaps, straightforward articles, informative Webcasts and more! Get everything you need to get up to speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
