Since it wasn't really clear whether my script approached the problem of deleting segments correctly, I refactored it so it generates the new number of segments, merges them into one, then deletes the "new" segments. Not as efficient disk space wise, but still removes a large number of the segments that are not being referenced by anything due to not being indexed yet.
I reupdated the wiki. Unless there is any more clarification regarding the issue, hopefully I won't have to bombard your inbox with any more emails regarding this. Matt Lukas Vlcek wrote: > Hi again, > > I just found related discussion here: > http://www.nabble.com/NullPointException-tf2045994r1.html > > I think these guys are discussing similar problem and if I understood > the conclusion correctly then the only solution right now is to write > some code and test which segments are used in index and which are not. > > Regards, > Lukas > > On 8/4/06, Lukas Vlcek <[EMAIL PROTECTED]> wrote: >> Matthew, >> >> In fact I didn't realize you are doing merge stuff (sorry for that) >> but frankly I don't know how exactly merging works and if this >> strategy would work in the long time perspective and whether it is >> universal approach in all variability of cases which may occur during >> crawling (-topN, threads frozen, pages unavailable, crawling dies, ... >> etc), may be it is correct path. I would appreciate if anybody can >> answer this question precisely. >> >> Thanks, >> Lukas >> >> On 8/4/06, Matthew Holt <[EMAIL PROTECTED]> wrote: >> > If anyone doesnt mind taking a look... >> > >> > >> > >> > ---------- Forwarded message ---------- >> > From: Matthew Holt <[EMAIL PROTECTED]> >> > To: [email protected] >> > Date: Fri, 04 Aug 2006 10:07:57 -0400 >> > Subject: Re: 0.8 Recrawl script updated >> > Lukas, >> > Thanks for your e-mail. I assumed I could drop the $depth number of >> > oldest segments because I first merged them all into one segment >> (which >> > I don't drop). Am I incorrect in my assumption and can this cause >> > problems in the future? If so, then I'll go back to the original >> version >> > of my script when I kept all the segments without merging. However, it >> > just seemed like if that is the case, it will be a problem after >> enough >> > number of recrawls due to the large amount of segments being kept. >> > >> > Thanks, >> > Matt >> > >> > Lukas Vlcek wrote: >> > > Hi Matthew, >> > > >> > > I am surious about one thing. How do you know you can just drop >> $depth >> > > number of the most oldest segments in the end? I haven't studied >> nutch >> > > code regarding this topic yet but I thought that segment can be >> > > dropped once you are sure that all its content is already crawled in >> > > some newer segments (which should be checked somehow via some >> > > function/script - which hasen't been yet implemented to my >> knowledge). >> > > >> > > Also I don't think this question has been discussed on dev/user >> lists >> > > in detail yet so I just wanted to ask you about your opinion. The >> > > situation could get even more complicated if people add -topN >> > > parameter into script (which can happen because some might prefer >> > > crawling in ten smaller bunches over to two huge crawls due to >> various >> > > technical reasons). >> > > >> > > Anyway, never mind if you don't want to bother about my silly >> question >> > > :-) >> > > >> > > Regards, >> > > Lukas >> > > >> > > On 8/4/06, Matthew Holt <[EMAIL PROTECTED]> wrote: >> > >> Last email regarding this script. I found a bug in it that is >> sporadic >> > >> (i think it only affected different setups). However, since it >> would be >> > >> a problem sometimes, I refactored the script. I'd suggest you >> redownload >> > >> the script if you are using it. >> > >> >> > >> Matt >> > >> >> > >> Matthew Holt wrote: >> > >> > I'm currently pretty busy at work. If I have I'll do it later. >> > >> > >> > >> > The version 0.8 recrawl script has a working version online >> now. I >> > >> > temporarily modified it on the website yesterday when I ran >> into some >> > >> > problems, but I further tested it and the actual working code is >> > >> > modified now. So if you got it off the web site any time >> yesterday, I >> > >> > would redownload the script. >> > >> > >> > >> > Matt >> > >> > >> > >> > Lourival JĂșnior wrote: >> > >> >> Hi Matthew! >> > >> >> >> > >> >> Could you update the script to the version 0.7.2 with the same >> > >> >> functionalities? I write a scritp that do this, but it don't >> work >> > >> very >> > >> >> well... >> > >> >> >> > >> >> Regards! >> > >> >> >> > >> >> On 8/2/06, Matthew Holt <[EMAIL PROTECTED]> wrote: >> > >> >>> >> > >> >>> Just letting everyone know that I updated the recrawl script >> on the >> > >> >>> Wiki. It now merges the created segments them deletes the old >> > >> segs to >> > >> >>> prevent a lot of unneeded data remaining/growing on the hard >> drive. >> > >> >>> Matt >> > >> >>> >> > >> >>> >> > >> >>> >> > >> >> http://wiki.apache.org/nutch/IntranetRecrawl?action=show#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03 >> >> >> > >> >> > >> >>> >> > >> >>> >> > >> >> >> > >> >> >> > >> >> >> > >> > >> > >> >> > > >> > >> > >> > >> > >> > ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
