Hi Matthew, Thanks for your work on this! And I do apologize if my stupid questions caused your discomfort. Anyway, I already started testing former version of your script so I will look closer at your updated version as well and will keep you posted.
As for the segment merging, the more I search/read on web about it the more I think it should work as you expected, on the other hand I know I can't be sure until I hack the source code (hacking nutch is still a pain for me). Thanks and regards! Lukas On 8/8/06, Matthew Holt <[EMAIL PROTECTED]> wrote: > It's not needed.. you use the bin/nutch script to generate the initial > crawl.. > > details here: > http://lucene.apache.org/nutch/tutorial8.html#Intranet+Crawling > > Fred Tyre wrote: > > First of all, thanks for the recrawl script. > > I believe it will save me a few headaches. > > > > Secondly, is there a reason that there isn't a crawl script posted on the > > FAQ? > > > > As far as I can tell, you could take your recrawl script and add in the > > following line after you setup the crawl subdirectories. > > $FT_NUTCH_BIN/nutch crawl urls -dir $crawl_dir -threads 2 -depth 3 -topN > > 50 > > > > Obviously, the threads, depth and topN could be parameters as well. > > > > Thanks again. > > > > -----Original Message----- > > From: Matthew Holt [mailto:[EMAIL PROTECTED] > > Sent: Tuesday, August 08, 2006 2:00 PM > > To: [email protected]; [email protected] > > Subject: Re: [Fwd: Re: 0.8 Recrawl script updated] > > > > > > Since it wasn't really clear whether my script approached the problem of > > deleting segments correctly, I refactored it so it generates the new > > number of segments, merges them into one, then deletes the "new" > > segments. Not as efficient disk space wise, but still removes a large > > number of the segments that are not being referenced by anything due to > > not being indexed yet. > > > > I reupdated the wiki. Unless there is any more clarification regarding > > the issue, hopefully I won't have to bombard your inbox with any more > > emails regarding this. > > > > Matt > > > > Lukas Vlcek wrote: > > > >> Hi again, > >> > >> I just found related discussion here: > >> http://www.nabble.com/NullPointException-tf2045994r1.html > >> > >> I think these guys are discussing similar problem and if I understood > >> the conclusion correctly then the only solution right now is to write > >> some code and test which segments are used in index and which are not. > >> > >> Regards, > >> Lukas > >> > >> On 8/4/06, Lukas Vlcek <[EMAIL PROTECTED]> wrote: > >> > >>> Matthew, > >>> > >>> In fact I didn't realize you are doing merge stuff (sorry for that) > >>> but frankly I don't know how exactly merging works and if this > >>> strategy would work in the long time perspective and whether it is > >>> universal approach in all variability of cases which may occur during > >>> crawling (-topN, threads frozen, pages unavailable, crawling dies, ... > >>> etc), may be it is correct path. I would appreciate if anybody can > >>> answer this question precisely. > >>> > >>> Thanks, > >>> Lukas > >>> > >>> On 8/4/06, Matthew Holt <[EMAIL PROTECTED]> wrote: > >>> > >>>> If anyone doesnt mind taking a look... > >>>> > >>>> > >>>> > >>>> ---------- Forwarded message ---------- > >>>> From: Matthew Holt <[EMAIL PROTECTED]> > >>>> To: [email protected] > >>>> Date: Fri, 04 Aug 2006 10:07:57 -0400 > >>>> Subject: Re: 0.8 Recrawl script updated > >>>> Lukas, > >>>> Thanks for your e-mail. I assumed I could drop the $depth number of > >>>> oldest segments because I first merged them all into one segment > >>>> > >>> (which > >>> > >>>> I don't drop). Am I incorrect in my assumption and can this cause > >>>> problems in the future? If so, then I'll go back to the original > >>>> > >>> version > >>> > >>>> of my script when I kept all the segments without merging. However, it > >>>> just seemed like if that is the case, it will be a problem after > >>>> > >>> enough > >>> > >>>> number of recrawls due to the large amount of segments being kept. > >>>> > >>>> Thanks, > >>>> Matt > >>>> > >>>> Lukas Vlcek wrote: > >>>> > >>>>> Hi Matthew, > >>>>> > >>>>> I am surious about one thing. How do you know you can just drop > >>>>> > >>> $depth > >>> > >>>>> number of the most oldest segments in the end? I haven't studied > >>>>> > >>> nutch > >>> > >>>>> code regarding this topic yet but I thought that segment can be > >>>>> dropped once you are sure that all its content is already crawled in > >>>>> some newer segments (which should be checked somehow via some > >>>>> function/script - which hasen't been yet implemented to my > >>>>> > >>> knowledge). > >>> > >>>>> Also I don't think this question has been discussed on dev/user > >>>>> > >>> lists > >>> > >>>>> in detail yet so I just wanted to ask you about your opinion. The > >>>>> situation could get even more complicated if people add -topN > >>>>> parameter into script (which can happen because some might prefer > >>>>> crawling in ten smaller bunches over to two huge crawls due to > >>>>> > >>> various > >>> > >>>>> technical reasons). > >>>>> > >>>>> Anyway, never mind if you don't want to bother about my silly > >>>>> > >>> question > >>> > >>>>> :-) > >>>>> > >>>>> Regards, > >>>>> Lukas > >>>>> > >>>>> On 8/4/06, Matthew Holt <[EMAIL PROTECTED]> wrote: > >>>>> > >>>>>> Last email regarding this script. I found a bug in it that is > >>>>>> > >>> sporadic > >>> > >>>>>> (i think it only affected different setups). However, since it > >>>>>> > >>> would be > >>> > >>>>>> a problem sometimes, I refactored the script. I'd suggest you > >>>>>> > >>> redownload > >>> > >>>>>> the script if you are using it. > >>>>>> > >>>>>> Matt > >>>>>> > >>>>>> Matthew Holt wrote: > >>>>>> > >>>>>>> I'm currently pretty busy at work. If I have I'll do it later. > >>>>>>> > >>>>>>> The version 0.8 recrawl script has a working version online > >>>>>>> > >>> now. I > >>> > >>>>>>> temporarily modified it on the website yesterday when I ran > >>>>>>> > >>> into some > >>> > >>>>>>> problems, but I further tested it and the actual working code is > >>>>>>> modified now. So if you got it off the web site any time > >>>>>>> > >>> yesterday, I > >>> > >>>>>>> would redownload the script. > >>>>>>> > >>>>>>> Matt > >>>>>>> > >>>>>>> Lourival JĂșnior wrote: > >>>>>>> > >>>>>>>> Hi Matthew! > >>>>>>>> > >>>>>>>> Could you update the script to the version 0.7.2 with the same > >>>>>>>> functionalities? I write a scritp that do this, but it don't > >>>>>>>> > >>> work > >>> > >>>>>> very > >>>>>> > >>>>>>>> well... > >>>>>>>> > >>>>>>>> Regards! > >>>>>>>> > >>>>>>>> On 8/2/06, Matthew Holt <[EMAIL PROTECTED]> wrote: > >>>>>>>> > >>>>>>>>> Just letting everyone know that I updated the recrawl script > >>>>>>>>> > >>> on the > >>> > >>>>>>>>> Wiki. It now merges the created segments them deletes the old > >>>>>>>>> > >>>>>> segs to > >>>>>> > >>>>>>>>> prevent a lot of unneeded data remaining/growing on the hard > >>>>>>>>> > >>> drive. > >>> > >>>>>>>>> Matt > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > > http://wiki.apache.org/nutch/IntranetRecrawl?action=show#head-e58e25a0b9530b > > b6fcdfb282fd27a207fc0aff03 > > > >>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>> > >>>> > >>>> > > > > > > > ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
