Matthew, Looking over your recrawl script, it seems like you are merging *all* segments together, including any old segments. It seems to me that you could just be merging only the new segments together. Maybe you could explain a little of the reasoning behind this. Thanks, Jacob Brunson
On 8/8/06, Matthew Holt <[EMAIL PROTECTED]> wrote: > Lukas, > Not stupid at all. I was already experiencing some issues with the > script due to tomcat not releasing its lock on some of the directories I > was trying to delete. This isn't the most efficient solution, but I > believe it to be the most stable. > > Thanks for bringing the issues to my attention and keep me posted on the > change. The job at which I'm using Nutch is about to end here on Friday, > but I will try to join the nutch-user mailing list using my personal > email address and keep up with development. If you don't hear from me > and need me for any reason, my personal email address is mholtATelonDOTedu > > Take care, > Matt > > Lukas Vlcek wrote: > > Hi Matthew, > > > > Thanks for your work on this! And I do apologize if my stupid > > questions caused your discomfort. Anyway, I already started testing > > former version of your script so I will look closer at your updated > > version as well and will keep you posted. > > > > As for the segment merging, the more I search/read on web about it the > > more I think it should work as you expected, on the other hand I know > > I can't be sure until I hack the source code (hacking nutch is still a > > pain for me). > > > > Thanks and regards! > > Lukas > > > > On 8/8/06, Matthew Holt <[EMAIL PROTECTED]> wrote: > >> It's not needed.. you use the bin/nutch script to generate the initial > >> crawl.. > >> > >> details here: > >> http://lucene.apache.org/nutch/tutorial8.html#Intranet+Crawling > >> > >> Fred Tyre wrote: > >> > First of all, thanks for the recrawl script. > >> > I believe it will save me a few headaches. > >> > > >> > Secondly, is there a reason that there isn't a crawl script posted > >> on the > >> > FAQ? > >> > > >> > As far as I can tell, you could take your recrawl script and add in > >> the > >> > following line after you setup the crawl subdirectories. > >> > $FT_NUTCH_BIN/nutch crawl urls -dir $crawl_dir -threads 2 -depth > >> 3 -topN > >> > 50 > >> > > >> > Obviously, the threads, depth and topN could be parameters as well. > >> > > >> > Thanks again. > >> > > >> > -----Original Message----- > >> > From: Matthew Holt [mailto:[EMAIL PROTECTED] > >> > Sent: Tuesday, August 08, 2006 2:00 PM > >> > To: nutch-dev@lucene.apache.org; nutch-user@lucene.apache.org > >> > Subject: Re: [Fwd: Re: 0.8 Recrawl script updated] > >> > > >> > > >> > Since it wasn't really clear whether my script approached the > >> problem of > >> > deleting segments correctly, I refactored it so it generates the new > >> > number of segments, merges them into one, then deletes the "new" > >> > segments. Not as efficient disk space wise, but still removes a large > >> > number of the segments that are not being referenced by anything > >> due to > >> > not being indexed yet. > >> > > >> > I reupdated the wiki. Unless there is any more clarification regarding > >> > the issue, hopefully I won't have to bombard your inbox with any more > >> > emails regarding this. > >> > > >> > Matt > >> > > >> > Lukas Vlcek wrote: > >> > > >> >> Hi again, > >> >> > >> >> I just found related discussion here: > >> >> http://www.nabble.com/NullPointException-tf2045994r1.html > >> >> > >> >> I think these guys are discussing similar problem and if I understood > >> >> the conclusion correctly then the only solution right now is to write > >> >> some code and test which segments are used in index and which are > >> not. > >> >> > >> >> Regards, > >> >> Lukas > >> >> > >> >> On 8/4/06, Lukas Vlcek <[EMAIL PROTECTED]> wrote: > >> >> > >> >>> Matthew, > >> >>> > >> >>> In fact I didn't realize you are doing merge stuff (sorry for that) > >> >>> but frankly I don't know how exactly merging works and if this > >> >>> strategy would work in the long time perspective and whether it is > >> >>> universal approach in all variability of cases which may occur > >> during > >> >>> crawling (-topN, threads frozen, pages unavailable, crawling > >> dies, ... > >> >>> etc), may be it is correct path. I would appreciate if anybody can > >> >>> answer this question precisely. > >> >>> > >> >>> Thanks, > >> >>> Lukas > >> >>> > >> >>> On 8/4/06, Matthew Holt <[EMAIL PROTECTED]> wrote: > >> >>> > >> >>>> If anyone doesnt mind taking a look... > >> >>>> > >> >>>> > >> >>>> > >> >>>> ---------- Forwarded message ---------- > >> >>>> From: Matthew Holt <[EMAIL PROTECTED]> > >> >>>> To: nutch-user@lucene.apache.org > >> >>>> Date: Fri, 04 Aug 2006 10:07:57 -0400 > >> >>>> Subject: Re: 0.8 Recrawl script updated > >> >>>> Lukas, > >> >>>> Thanks for your e-mail. I assumed I could drop the $depth > >> number of > >> >>>> oldest segments because I first merged them all into one segment > >> >>>> > >> >>> (which > >> >>> > >> >>>> I don't drop). Am I incorrect in my assumption and can this cause > >> >>>> problems in the future? If so, then I'll go back to the original > >> >>>> > >> >>> version > >> >>> > >> >>>> of my script when I kept all the segments without merging. > >> However, it > >> >>>> just seemed like if that is the case, it will be a problem after > >> >>>> > >> >>> enough > >> >>> > >> >>>> number of recrawls due to the large amount of segments being kept. > >> >>>> > >> >>>> Thanks, > >> >>>> Matt > >> >>>> > >> >>>> Lukas Vlcek wrote: > >> >>>> > >> >>>>> Hi Matthew, > >> >>>>> > >> >>>>> I am surious about one thing. How do you know you can just drop > >> >>>>> > >> >>> $depth > >> >>> > >> >>>>> number of the most oldest segments in the end? I haven't studied > >> >>>>> > >> >>> nutch > >> >>> > >> >>>>> code regarding this topic yet but I thought that segment can be > >> >>>>> dropped once you are sure that all its content is already > >> crawled in > >> >>>>> some newer segments (which should be checked somehow via some > >> >>>>> function/script - which hasen't been yet implemented to my > >> >>>>> > >> >>> knowledge). > >> >>> > >> >>>>> Also I don't think this question has been discussed on dev/user > >> >>>>> > >> >>> lists > >> >>> > >> >>>>> in detail yet so I just wanted to ask you about your opinion. The > >> >>>>> situation could get even more complicated if people add -topN > >> >>>>> parameter into script (which can happen because some might prefer > >> >>>>> crawling in ten smaller bunches over to two huge crawls due to > >> >>>>> > >> >>> various > >> >>> > >> >>>>> technical reasons). > >> >>>>> > >> >>>>> Anyway, never mind if you don't want to bother about my silly > >> >>>>> > >> >>> question > >> >>> > >> >>>>> :-) > >> >>>>> > >> >>>>> Regards, > >> >>>>> Lukas > >> >>>>> > >> >>>>> On 8/4/06, Matthew Holt <[EMAIL PROTECTED]> wrote: > >> >>>>> > >> >>>>>> Last email regarding this script. I found a bug in it that is > >> >>>>>> > >> >>> sporadic > >> >>> > >> >>>>>> (i think it only affected different setups). However, since it > >> >>>>>> > >> >>> would be > >> >>> > >> >>>>>> a problem sometimes, I refactored the script. I'd suggest you > >> >>>>>> > >> >>> redownload > >> >>> > >> >>>>>> the script if you are using it. > >> >>>>>> > >> >>>>>> Matt > >> >>>>>> > >> >>>>>> Matthew Holt wrote: > >> >>>>>> > >> >>>>>>> I'm currently pretty busy at work. If I have I'll do it later. > >> >>>>>>> > >> >>>>>>> The version 0.8 recrawl script has a working version online > >> >>>>>>> > >> >>> now. I > >> >>> > >> >>>>>>> temporarily modified it on the website yesterday when I ran > >> >>>>>>> > >> >>> into some > >> >>> > >> >>>>>>> problems, but I further tested it and the actual working code is > >> >>>>>>> modified now. So if you got it off the web site any time > >> >>>>>>> > >> >>> yesterday, I > >> >>> > >> >>>>>>> would redownload the script. > >> >>>>>>> > >> >>>>>>> Matt > >> >>>>>>> > >> >>>>>>> Lourival JĂșnior wrote: > >> >>>>>>> > >> >>>>>>>> Hi Matthew! > >> >>>>>>>> > >> >>>>>>>> Could you update the script to the version 0.7.2 with the same > >> >>>>>>>> functionalities? I write a scritp that do this, but it don't > >> >>>>>>>> > >> >>> work > >> >>> > >> >>>>>> very > >> >>>>>> > >> >>>>>>>> well... > >> >>>>>>>> > >> >>>>>>>> Regards! > >> >>>>>>>> > >> >>>>>>>> On 8/2/06, Matthew Holt <[EMAIL PROTECTED]> wrote: > >> >>>>>>>> > >> >>>>>>>>> Just letting everyone know that I updated the recrawl script > >> >>>>>>>>> > >> >>> on the > >> >>> > >> >>>>>>>>> Wiki. It now merges the created segments them deletes the old > >> >>>>>>>>> > >> >>>>>> segs to > >> >>>>>> > >> >>>>>>>>> prevent a lot of unneeded data remaining/growing on the hard > >> >>>>>>>>> > >> >>> drive. > >> >>> > >> >>>>>>>>> Matt > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> > >> > > >> http://wiki.apache.org/nutch/IntranetRecrawl?action=show#head-e58e25a0b9530b > >> > >> > b6fcdfb282fd27a207fc0aff03 > >> > > >> >>>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>> > >> >>>> > >> >>>> > >> > > >> > > >> > > >> > > > -- http://JacobBrunson.com ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general