Re: [Nutch-general] [Fwd: Re: 0.8 Recrawl script updated]

Matthew Holt Tue, 08 Aug 2006 14:51:28 -0700

Lukas,
  Not stupid at all. I was already experiencing some issues with the 
script due to tomcat not releasing its lock on some of the directories I 
was trying to delete. This isn't the most efficient solution, but I 
believe it to be the most stable.


Thanks for bringing the issues to my attention and keep me posted on the 
change. The job at which I'm using Nutch is about to end here on Friday, 
but I will try to join the nutch-user mailing list using my personal 
email address and keep up with development. If you don't hear from me 
and need me for any reason, my personal email address is mholtATelonDOTedu

Take care,
  Matt

Lukas Vlcek wrote:
> Hi Matthew,
>
> Thanks for your work on this! And I do apologize if my stupid
> questions caused your discomfort. Anyway, I already started testing
> former version of your script so I will look closer at your updated
> version as well and will keep you posted.
>
> As for the segment merging, the more I search/read on web about it the
> more I think it should work as you expected, on the other hand I know
> I can't be sure until I hack the source code (hacking nutch is still a
> pain for me).
>
> Thanks and regards!
> Lukas
>
> On 8/8/06, Matthew Holt <[EMAIL PROTECTED]> wrote:
>> It's not needed.. you use the bin/nutch script to generate the initial
>> crawl..
>>
>> details here:
>> http://lucene.apache.org/nutch/tutorial8.html#Intranet+Crawling
>>
>> Fred Tyre wrote:
>> > First of all, thanks for the recrawl script.
>> > I believe it will save me a few headaches.
>> >
>> > Secondly, is there a reason that there isn't a crawl script posted 
>> on the
>> > FAQ?
>> >
>> > As far as I can tell, you could take your recrawl script and add in 
>> the
>> > following line after you setup the crawl subdirectories.
>> >    $FT_NUTCH_BIN/nutch crawl urls -dir $crawl_dir -threads 2 -depth 
>> 3 -topN
>> > 50
>> >
>> > Obviously, the threads, depth and topN could be parameters as well.
>> >
>> > Thanks again.
>> >
>> > -----Original Message-----
>> > From: Matthew Holt [mailto:[EMAIL PROTECTED]
>> > Sent: Tuesday, August 08, 2006 2:00 PM
>> > To: [email protected]; [email protected]
>> > Subject: Re: [Fwd: Re: 0.8 Recrawl script updated]
>> >
>> >
>> > Since it wasn't really clear whether my script approached the 
>> problem of
>> > deleting segments correctly, I refactored it so it generates the new
>> > number of segments, merges them into one, then deletes the "new"
>> > segments. Not as efficient disk space wise, but still removes a large
>> > number of the segments that are not being referenced by anything 
>> due to
>> > not being indexed yet.
>> >
>> > I reupdated the wiki. Unless there is any more clarification regarding
>> > the issue, hopefully I won't have to bombard your inbox with any more
>> > emails regarding this.
>> >
>> > Matt
>> >
>> > Lukas Vlcek wrote:
>> >
>> >> Hi again,
>> >>
>> >> I just found related discussion here:
>> >> http://www.nabble.com/NullPointException-tf2045994r1.html
>> >>
>> >> I think these guys are discussing similar problem and if I understood
>> >> the conclusion correctly then the only solution right now is to write
>> >> some code and test which segments are used in index and which are 
>> not.
>> >>
>> >> Regards,
>> >> Lukas
>> >>
>> >> On 8/4/06, Lukas Vlcek <[EMAIL PROTECTED]> wrote:
>> >>
>> >>> Matthew,
>> >>>
>> >>> In fact I didn't realize you are doing merge stuff (sorry for that)
>> >>> but frankly I don't know how exactly merging works and if this
>> >>> strategy would work in the long time perspective and whether it is
>> >>> universal approach in all variability of cases which may occur 
>> during
>> >>> crawling (-topN, threads frozen, pages unavailable, crawling 
>> dies, ...
>> >>> etc), may be it is correct path. I would appreciate if anybody can
>> >>> answer this question precisely.
>> >>>
>> >>> Thanks,
>> >>> Lukas
>> >>>
>> >>> On 8/4/06, Matthew Holt <[EMAIL PROTECTED]> wrote:
>> >>>
>> >>>> If anyone doesnt mind taking a look...
>> >>>>
>> >>>>
>> >>>>
>> >>>> ---------- Forwarded message ----------
>> >>>> From: Matthew Holt <[EMAIL PROTECTED]>
>> >>>> To: [email protected]
>> >>>> Date: Fri, 04 Aug 2006 10:07:57 -0400
>> >>>> Subject: Re: 0.8 Recrawl script updated
>> >>>> Lukas,
>> >>>>    Thanks for your e-mail. I assumed I could drop the $depth 
>> number of
>> >>>> oldest segments because I first merged them all into one segment
>> >>>>
>> >>> (which
>> >>>
>> >>>> I don't drop). Am I incorrect in my assumption and can this cause
>> >>>> problems in the future? If so, then I'll go back to the original
>> >>>>
>> >>> version
>> >>>
>> >>>> of my script when I kept all the segments without merging. 
>> However, it
>> >>>> just seemed like if that is the case, it will be a problem after
>> >>>>
>> >>> enough
>> >>>
>> >>>> number of recrawls due to the large amount of segments being kept.
>> >>>>
>> >>>>  Thanks,
>> >>>>   Matt
>> >>>>
>> >>>> Lukas Vlcek wrote:
>> >>>>
>> >>>>> Hi Matthew,
>> >>>>>
>> >>>>> I am surious about one thing. How do you know you can just drop
>> >>>>>
>> >>> $depth
>> >>>
>> >>>>> number of the most oldest segments in the end? I haven't studied
>> >>>>>
>> >>> nutch
>> >>>
>> >>>>> code regarding this topic yet but I thought that segment can be
>> >>>>> dropped once you are sure that all its content is already 
>> crawled in
>> >>>>> some newer segments (which should be checked somehow via some
>> >>>>> function/script - which hasen't been yet implemented to my
>> >>>>>
>> >>> knowledge).
>> >>>
>> >>>>> Also I don't think this question has been discussed on dev/user
>> >>>>>
>> >>> lists
>> >>>
>> >>>>> in detail yet so I just wanted to ask you about your opinion. The
>> >>>>> situation could get even more complicated if people add -topN
>> >>>>> parameter into script (which can happen because some might prefer
>> >>>>> crawling in ten smaller bunches over to two huge crawls due to
>> >>>>>
>> >>> various
>> >>>
>> >>>>> technical reasons).
>> >>>>>
>> >>>>> Anyway, never mind if you don't want to bother about my silly
>> >>>>>
>> >>> question
>> >>>
>> >>>>> :-)
>> >>>>>
>> >>>>> Regards,
>> >>>>> Lukas
>> >>>>>
>> >>>>> On 8/4/06, Matthew Holt <[EMAIL PROTECTED]> wrote:
>> >>>>>
>> >>>>>> Last email regarding this script. I found a bug in it that is
>> >>>>>>
>> >>> sporadic
>> >>>
>> >>>>>> (i think it only affected different setups). However, since it
>> >>>>>>
>> >>> would be
>> >>>
>> >>>>>> a problem sometimes, I refactored the script. I'd suggest you
>> >>>>>>
>> >>> redownload
>> >>>
>> >>>>>> the script if you are using it.
>> >>>>>>
>> >>>>>> Matt
>> >>>>>>
>> >>>>>> Matthew Holt wrote:
>> >>>>>>
>> >>>>>>> I'm currently pretty busy at work. If I have I'll do it later.
>> >>>>>>>
>> >>>>>>> The version 0.8 recrawl script has a working version online
>> >>>>>>>
>> >>> now. I
>> >>>
>> >>>>>>> temporarily modified it on the website yesterday when I ran
>> >>>>>>>
>> >>> into some
>> >>>
>> >>>>>>> problems, but I further tested it and the actual working code is
>> >>>>>>> modified now. So if you got it off the web site any time
>> >>>>>>>
>> >>> yesterday, I
>> >>>
>> >>>>>>> would redownload the script.
>> >>>>>>>
>> >>>>>>> Matt
>> >>>>>>>
>> >>>>>>> Lourival Júnior wrote:
>> >>>>>>>
>> >>>>>>>> Hi Matthew!
>> >>>>>>>>
>> >>>>>>>> Could you update the script to the version 0.7.2 with the same
>> >>>>>>>> functionalities? I write a scritp that do this, but it don't
>> >>>>>>>>
>> >>> work
>> >>>
>> >>>>>> very
>> >>>>>>
>> >>>>>>>> well...
>> >>>>>>>>
>> >>>>>>>> Regards!
>> >>>>>>>>
>> >>>>>>>> On 8/2/06, Matthew Holt <[EMAIL PROTECTED]> wrote:
>> >>>>>>>>
>> >>>>>>>>> Just letting everyone know that I updated the recrawl script
>> >>>>>>>>>
>> >>> on the
>> >>>
>> >>>>>>>>> Wiki. It now merges the created segments them deletes the old
>> >>>>>>>>>
>> >>>>>> segs to
>> >>>>>>
>> >>>>>>>>> prevent a lot of unneeded data remaining/growing on the hard
>> >>>>>>>>>
>> >>> drive.
>> >>>
>> >>>>>>>>>   Matt
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> > 
>> http://wiki.apache.org/nutch/IntranetRecrawl?action=show#head-e58e25a0b9530b 
>>
>> > b6fcdfb282fd27a207fc0aff03
>> >
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>
>> >>>>
>> >>>>
>> >
>> >
>> >
>>
>

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] [Fwd: Re: 0.8 Recrawl script updated]

Reply via email to