No, I think you're right that indexing is done automatically after
intranet crawls. Just try "bin/nutch index yourSegment", if it says
that 'index.done exists already',then well.. you get the point. I
don't know what platform you're using, but try doing a "grep -r <some
text in your crawled site> *". The grep command should match on both
your segment data and the binary index that have been built.
I have run in to a similar problem, where the Websearch thingie does
not work, but a manual search using the IndexSearcher class does work.
Also, try opening your index from the LUKE program if you haven't
already. It's a very handy tool for validating and test-searching your
data.

Good luck,
Fredrik

On 7/23/05, Feng (Michael) Ji <[EMAIL PROTECTED]> wrote:
> hi Fredrik:
> 
> Actually, I use nutch/crawl command as following:
> "
> bin/nutch crawl urls -dir crawl-s -depth 1 >&
> crawl-s.log
> "
> I guess I don't need to do index explicitly after
> crawl. Is it right?
> 
> My sample crawling doesn't go deeply and only stop at
> the home page of the URL. 
> 
> I guess the -depth is defined for crawling the website
> which is pointed out from initial page, is it right?
> 
> One thing I found, the result in /segments/ will have
> the same number of sub-dir (which are all time stamped
> number) as the -depth parameter,
> 
> thanks a lot,
> 
> Michael,
> 
> --- Fredrik Andersson <[EMAIL PROTECTED]>
> wrote:
> 
> > Hi Michael.
> > 
> > Have you indexed the crawl/segment? Easy to forget
> > sometimes : ) Also,
> > check the crawler-tools.xml or whatever it's called,
> > so that ASP pages
> > aren't blocked or anything. The Nutch crawler
> > doesn't by default
> > handle parameters (committees.asp?viewPerson=Ji), I
> > guess that could
> > be an issue as well. No errors or funny stuff in the
> > logs?
> > 
> > Fredrik
> > 
> > On 7/23/05, Feng (Michael) Ji <[EMAIL PROTECTED]>
> > wrote:
> > > Hi there:
> > > 
> > > I have a question about the crawling depth VS
> > search
> > > result. I attached part of my log information;
> > > 
> > > "
> > > 050722 181508 fetching
> > > http://www.committemuse.com/content/committees.asp
> > > :
> > > :
> > > 050722 181508 fetching
> > > :
> > > 050722 181508 status: segment 20050722181440, 100
> > > pages, 4 errors, 1952888 bytes, 26204 ms
> > > "
> > > 
> > > And I see segment in my tomcat box.
> > > 
> > > But when I do search the specific word in that
> > page,
> > > it return 0.
> > > 
> > > Is that because the page is written in "asp"?
> > > 
> > > thanks,
> > > 
> > > Michael,
> > > 
> > > 
> > > 
> > > __________________________________________________
> > > Do You Yahoo!?
> > > Tired of spam?  Yahoo! Mail has the best spam
> > protection around
> > > http://mail.yahoo.com
> > >
> > 
> 
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around 
> http://mail.yahoo.com 
>


-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_idt77&alloc_id492&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to