It's not needed.. you use the bin/nutch script to generate the initial crawl..
details here: http://lucene.apache.org/nutch/tutorial8.html#Intranet+Crawling Fred Tyre wrote: > First of all, thanks for the recrawl script. > I believe it will save me a few headaches. > > Secondly, is there a reason that there isn't a crawl script posted on the > FAQ? > > As far as I can tell, you could take your recrawl script and add in the > following line after you setup the crawl subdirectories. > $FT_NUTCH_BIN/nutch crawl urls -dir $crawl_dir -threads 2 -depth 3 -topN > 50 > > Obviously, the threads, depth and topN could be parameters as well. > > Thanks again. > > -----Original Message----- > From: Matthew Holt [mailto:[EMAIL PROTECTED] > Sent: Tuesday, August 08, 2006 2:00 PM > To: [email protected]; [email protected] > Subject: Re: [Fwd: Re: 0.8 Recrawl script updated] > > > Since it wasn't really clear whether my script approached the problem of > deleting segments correctly, I refactored it so it generates the new > number of segments, merges them into one, then deletes the "new" > segments. Not as efficient disk space wise, but still removes a large > number of the segments that are not being referenced by anything due to > not being indexed yet. > > I reupdated the wiki. Unless there is any more clarification regarding > the issue, hopefully I won't have to bombard your inbox with any more > emails regarding this. > > Matt > > Lukas Vlcek wrote: > >> Hi again, >> >> I just found related discussion here: >> http://www.nabble.com/NullPointException-tf2045994r1.html >> >> I think these guys are discussing similar problem and if I understood >> the conclusion correctly then the only solution right now is to write >> some code and test which segments are used in index and which are not. >> >> Regards, >> Lukas >> >> On 8/4/06, Lukas Vlcek <[EMAIL PROTECTED]> wrote: >> >>> Matthew, >>> >>> In fact I didn't realize you are doing merge stuff (sorry for that) >>> but frankly I don't know how exactly merging works and if this >>> strategy would work in the long time perspective and whether it is >>> universal approach in all variability of cases which may occur during >>> crawling (-topN, threads frozen, pages unavailable, crawling dies, ... >>> etc), may be it is correct path. I would appreciate if anybody can >>> answer this question precisely. >>> >>> Thanks, >>> Lukas >>> >>> On 8/4/06, Matthew Holt <[EMAIL PROTECTED]> wrote: >>> >>>> If anyone doesnt mind taking a look... >>>> >>>> >>>> >>>> ---------- Forwarded message ---------- >>>> From: Matthew Holt <[EMAIL PROTECTED]> >>>> To: [email protected] >>>> Date: Fri, 04 Aug 2006 10:07:57 -0400 >>>> Subject: Re: 0.8 Recrawl script updated >>>> Lukas, >>>> Thanks for your e-mail. I assumed I could drop the $depth number of >>>> oldest segments because I first merged them all into one segment >>>> >>> (which >>> >>>> I don't drop). Am I incorrect in my assumption and can this cause >>>> problems in the future? If so, then I'll go back to the original >>>> >>> version >>> >>>> of my script when I kept all the segments without merging. However, it >>>> just seemed like if that is the case, it will be a problem after >>>> >>> enough >>> >>>> number of recrawls due to the large amount of segments being kept. >>>> >>>> Thanks, >>>> Matt >>>> >>>> Lukas Vlcek wrote: >>>> >>>>> Hi Matthew, >>>>> >>>>> I am surious about one thing. How do you know you can just drop >>>>> >>> $depth >>> >>>>> number of the most oldest segments in the end? I haven't studied >>>>> >>> nutch >>> >>>>> code regarding this topic yet but I thought that segment can be >>>>> dropped once you are sure that all its content is already crawled in >>>>> some newer segments (which should be checked somehow via some >>>>> function/script - which hasen't been yet implemented to my >>>>> >>> knowledge). >>> >>>>> Also I don't think this question has been discussed on dev/user >>>>> >>> lists >>> >>>>> in detail yet so I just wanted to ask you about your opinion. The >>>>> situation could get even more complicated if people add -topN >>>>> parameter into script (which can happen because some might prefer >>>>> crawling in ten smaller bunches over to two huge crawls due to >>>>> >>> various >>> >>>>> technical reasons). >>>>> >>>>> Anyway, never mind if you don't want to bother about my silly >>>>> >>> question >>> >>>>> :-) >>>>> >>>>> Regards, >>>>> Lukas >>>>> >>>>> On 8/4/06, Matthew Holt <[EMAIL PROTECTED]> wrote: >>>>> >>>>>> Last email regarding this script. I found a bug in it that is >>>>>> >>> sporadic >>> >>>>>> (i think it only affected different setups). However, since it >>>>>> >>> would be >>> >>>>>> a problem sometimes, I refactored the script. I'd suggest you >>>>>> >>> redownload >>> >>>>>> the script if you are using it. >>>>>> >>>>>> Matt >>>>>> >>>>>> Matthew Holt wrote: >>>>>> >>>>>>> I'm currently pretty busy at work. If I have I'll do it later. >>>>>>> >>>>>>> The version 0.8 recrawl script has a working version online >>>>>>> >>> now. I >>> >>>>>>> temporarily modified it on the website yesterday when I ran >>>>>>> >>> into some >>> >>>>>>> problems, but I further tested it and the actual working code is >>>>>>> modified now. So if you got it off the web site any time >>>>>>> >>> yesterday, I >>> >>>>>>> would redownload the script. >>>>>>> >>>>>>> Matt >>>>>>> >>>>>>> Lourival JĂșnior wrote: >>>>>>> >>>>>>>> Hi Matthew! >>>>>>>> >>>>>>>> Could you update the script to the version 0.7.2 with the same >>>>>>>> functionalities? I write a scritp that do this, but it don't >>>>>>>> >>> work >>> >>>>>> very >>>>>> >>>>>>>> well... >>>>>>>> >>>>>>>> Regards! >>>>>>>> >>>>>>>> On 8/2/06, Matthew Holt <[EMAIL PROTECTED]> wrote: >>>>>>>> >>>>>>>>> Just letting everyone know that I updated the recrawl script >>>>>>>>> >>> on the >>> >>>>>>>>> Wiki. It now merges the created segments them deletes the old >>>>>>>>> >>>>>> segs to >>>>>> >>>>>>>>> prevent a lot of unneeded data remaining/growing on the hard >>>>>>>>> >>> drive. >>> >>>>>>>>> Matt >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> > http://wiki.apache.org/nutch/IntranetRecrawl?action=show#head-e58e25a0b9530b > b6fcdfb282fd27a207fc0aff03 > >>>>>>>>> >>>>>>>> >>>>>>>> >>>> >>>> >>>> > > > ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
