Re: [Nutch-general] [Fwd: Re: 0.8 Recrawl script updated]

Matthew Holt Tue, 08 Aug 2006 12:00:17 -0700

Since it wasn't really clear whether my script approached the problem of 
deleting segments correctly, I refactored it so it generates the new 
number of segments, merges them into one, then deletes the "new" 
segments. Not as efficient disk space wise, but still removes a large 
number of the segments that are not being referenced by anything due to 
not being indexed yet.


I reupdated the wiki. Unless there is any more clarification regarding 
the issue, hopefully I won't have to bombard your inbox with any more 
emails regarding this.

Matt

Lukas Vlcek wrote:
> Hi again,
>
> I just found related discussion here:
> http://www.nabble.com/NullPointException-tf2045994r1.html
>
> I think these guys are discussing similar problem and if I understood
> the conclusion correctly then the only solution right now is to write
> some code and test which segments are used in index and which are not.
>
> Regards,
> Lukas
>
> On 8/4/06, Lukas Vlcek <[EMAIL PROTECTED]> wrote:
>> Matthew,
>>
>> In fact I didn't realize you are doing merge stuff (sorry for that)
>> but frankly I don't know how exactly merging works and if this
>> strategy would work in the long time perspective and whether it is
>> universal approach in all variability of cases which may occur during
>> crawling (-topN, threads frozen, pages unavailable, crawling dies, ...
>> etc), may be it is correct path. I would appreciate if anybody can
>> answer this question precisely.
>>
>> Thanks,
>> Lukas
>>
>> On 8/4/06, Matthew Holt <[EMAIL PROTECTED]> wrote:
>> > If anyone doesnt mind taking a look...
>> >
>> >
>> >
>> > ---------- Forwarded message ----------
>> > From: Matthew Holt <[EMAIL PROTECTED]>
>> > To: [email protected]
>> > Date: Fri, 04 Aug 2006 10:07:57 -0400
>> > Subject: Re: 0.8 Recrawl script updated
>> > Lukas,
>> >    Thanks for your e-mail. I assumed I could drop the $depth number of
>> > oldest segments because I first merged them all into one segment 
>> (which
>> > I don't drop). Am I incorrect in my assumption and can this cause
>> > problems in the future? If so, then I'll go back to the original 
>> version
>> > of my script when I kept all the segments without merging. However, it
>> > just seemed like if that is the case, it will be a problem after 
>> enough
>> > number of recrawls due to the large amount of segments being kept.
>> >
>> >  Thanks,
>> >   Matt
>> >
>> > Lukas Vlcek wrote:
>> > > Hi Matthew,
>> > >
>> > > I am surious about one thing. How do you know you can just drop 
>> $depth
>> > > number of the most oldest segments in the end? I haven't studied 
>> nutch
>> > > code regarding this topic yet but I thought that segment can be
>> > > dropped once you are sure that all its content is already crawled in
>> > > some newer segments (which should be checked somehow via some
>> > > function/script - which hasen't been yet implemented to my 
>> knowledge).
>> > >
>> > > Also I don't think this question has been discussed on dev/user 
>> lists
>> > > in detail yet so I just wanted to ask you about your opinion. The
>> > > situation could get even more complicated if people add -topN
>> > > parameter into script (which can happen because some might prefer
>> > > crawling in ten smaller bunches over to two huge crawls due to 
>> various
>> > > technical reasons).
>> > >
>> > > Anyway, never mind if you don't want to bother about my silly 
>> question
>> > > :-)
>> > >
>> > > Regards,
>> > > Lukas
>> > >
>> > > On 8/4/06, Matthew Holt <[EMAIL PROTECTED]> wrote:
>> > >> Last email regarding this script. I found a bug in it that is 
>> sporadic
>> > >> (i think it only affected different setups). However, since it 
>> would be
>> > >> a problem sometimes, I refactored the script. I'd suggest you 
>> redownload
>> > >> the script if you are using it.
>> > >>
>> > >> Matt
>> > >>
>> > >> Matthew Holt wrote:
>> > >> > I'm currently pretty busy at work. If I have I'll do it later.
>> > >> >
>> > >> > The version 0.8 recrawl script has a working version online 
>> now. I
>> > >> > temporarily modified it on the website yesterday when I ran 
>> into some
>> > >> > problems, but I further tested it and the actual working code is
>> > >> > modified now. So if you got it off the web site any time 
>> yesterday, I
>> > >> > would redownload the script.
>> > >> >
>> > >> > Matt
>> > >> >
>> > >> > Lourival Júnior wrote:
>> > >> >> Hi Matthew!
>> > >> >>
>> > >> >> Could you update the script to the version 0.7.2 with the same
>> > >> >> functionalities? I write a scritp that do this, but it don't 
>> work
>> > >> very
>> > >> >> well...
>> > >> >>
>> > >> >> Regards!
>> > >> >>
>> > >> >> On 8/2/06, Matthew Holt <[EMAIL PROTECTED]> wrote:
>> > >> >>>
>> > >> >>> Just letting everyone know that I updated the recrawl script 
>> on the
>> > >> >>> Wiki. It now merges the created segments them deletes the old
>> > >> segs to
>> > >> >>> prevent a lot of unneeded data remaining/growing on the hard 
>> drive.
>> > >> >>>   Matt
>> > >> >>>
>> > >> >>>
>> > >> >>>
>> > >> 
>> http://wiki.apache.org/nutch/IntranetRecrawl?action=show#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03
>>  
>>
>> > >>
>> > >> >>>
>> > >> >>>
>> > >> >>
>> > >> >>
>> > >> >>
>> > >> >
>> > >>
>> > >
>> >
>> >
>> >
>> >
>>
>

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] [Fwd: Re: 0.8 Recrawl script updated]

Reply via email to