Re: [Nutch-general] [Fwd: Re: 0.8 Recrawl script updated]

Jacob Brunson Fri, 11 Aug 2006 15:02:21 -0700

Matthew,
Looking over your recrawl script, it seems like you are merging *all*
segments together, including any old segments.  It seems to me that
you could just be merging only the new segments together. Maybe you
could explain a little of the reasoning behind this.
Thanks,
Jacob Brunson


On 8/8/06, Matthew Holt <[EMAIL PROTECTED]> wrote:
> Lukas,
>   Not stupid at all. I was already experiencing some issues with the
> script due to tomcat not releasing its lock on some of the directories I
> was trying to delete. This isn't the most efficient solution, but I
> believe it to be the most stable.
>
> Thanks for bringing the issues to my attention and keep me posted on the
> change. The job at which I'm using Nutch is about to end here on Friday,
> but I will try to join the nutch-user mailing list using my personal
> email address and keep up with development. If you don't hear from me
> and need me for any reason, my personal email address is mholtATelonDOTedu
>
> Take care,
>   Matt
>
> Lukas Vlcek wrote:
> > Hi Matthew,
> >
> > Thanks for your work on this! And I do apologize if my stupid
> > questions caused your discomfort. Anyway, I already started testing
> > former version of your script so I will look closer at your updated
> > version as well and will keep you posted.
> >
> > As for the segment merging, the more I search/read on web about it the
> > more I think it should work as you expected, on the other hand I know
> > I can't be sure until I hack the source code (hacking nutch is still a
> > pain for me).
> >
> > Thanks and regards!
> > Lukas
> >
> > On 8/8/06, Matthew Holt <[EMAIL PROTECTED]> wrote:
> >> It's not needed.. you use the bin/nutch script to generate the initial
> >> crawl..
> >>
> >> details here:
> >> http://lucene.apache.org/nutch/tutorial8.html#Intranet+Crawling
> >>
> >> Fred Tyre wrote:
> >> > First of all, thanks for the recrawl script.
> >> > I believe it will save me a few headaches.
> >> >
> >> > Secondly, is there a reason that there isn't a crawl script posted
> >> on the
> >> > FAQ?
> >> >
> >> > As far as I can tell, you could take your recrawl script and add in
> >> the
> >> > following line after you setup the crawl subdirectories.
> >> >    $FT_NUTCH_BIN/nutch crawl urls -dir $crawl_dir -threads 2 -depth
> >> 3 -topN
> >> > 50
> >> >
> >> > Obviously, the threads, depth and topN could be parameters as well.
> >> >
> >> > Thanks again.
> >> >
> >> > -----Original Message-----
> >> > From: Matthew Holt [mailto:[EMAIL PROTECTED]
> >> > Sent: Tuesday, August 08, 2006 2:00 PM
> >> > To: nutch-dev@lucene.apache.org; nutch-user@lucene.apache.org
> >> > Subject: Re: [Fwd: Re: 0.8 Recrawl script updated]
> >> >
> >> >
> >> > Since it wasn't really clear whether my script approached the
> >> problem of
> >> > deleting segments correctly, I refactored it so it generates the new
> >> > number of segments, merges them into one, then deletes the "new"
> >> > segments. Not as efficient disk space wise, but still removes a large
> >> > number of the segments that are not being referenced by anything
> >> due to
> >> > not being indexed yet.
> >> >
> >> > I reupdated the wiki. Unless there is any more clarification regarding
> >> > the issue, hopefully I won't have to bombard your inbox with any more
> >> > emails regarding this.
> >> >
> >> > Matt
> >> >
> >> > Lukas Vlcek wrote:
> >> >
> >> >> Hi again,
> >> >>
> >> >> I just found related discussion here:
> >> >> http://www.nabble.com/NullPointException-tf2045994r1.html
> >> >>
> >> >> I think these guys are discussing similar problem and if I understood
> >> >> the conclusion correctly then the only solution right now is to write
> >> >> some code and test which segments are used in index and which are
> >> not.
> >> >>
> >> >> Regards,
> >> >> Lukas
> >> >>
> >> >> On 8/4/06, Lukas Vlcek <[EMAIL PROTECTED]> wrote:
> >> >>
> >> >>> Matthew,
> >> >>>
> >> >>> In fact I didn't realize you are doing merge stuff (sorry for that)
> >> >>> but frankly I don't know how exactly merging works and if this
> >> >>> strategy would work in the long time perspective and whether it is
> >> >>> universal approach in all variability of cases which may occur
> >> during
> >> >>> crawling (-topN, threads frozen, pages unavailable, crawling
> >> dies, ...
> >> >>> etc), may be it is correct path. I would appreciate if anybody can
> >> >>> answer this question precisely.
> >> >>>
> >> >>> Thanks,
> >> >>> Lukas
> >> >>>
> >> >>> On 8/4/06, Matthew Holt <[EMAIL PROTECTED]> wrote:
> >> >>>
> >> >>>> If anyone doesnt mind taking a look...
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> ---------- Forwarded message ----------
> >> >>>> From: Matthew Holt <[EMAIL PROTECTED]>
> >> >>>> To: nutch-user@lucene.apache.org
> >> >>>> Date: Fri, 04 Aug 2006 10:07:57 -0400
> >> >>>> Subject: Re: 0.8 Recrawl script updated
> >> >>>> Lukas,
> >> >>>>    Thanks for your e-mail. I assumed I could drop the $depth
> >> number of
> >> >>>> oldest segments because I first merged them all into one segment
> >> >>>>
> >> >>> (which
> >> >>>
> >> >>>> I don't drop). Am I incorrect in my assumption and can this cause
> >> >>>> problems in the future? If so, then I'll go back to the original
> >> >>>>
> >> >>> version
> >> >>>
> >> >>>> of my script when I kept all the segments without merging.
> >> However, it
> >> >>>> just seemed like if that is the case, it will be a problem after
> >> >>>>
> >> >>> enough
> >> >>>
> >> >>>> number of recrawls due to the large amount of segments being kept.
> >> >>>>
> >> >>>>  Thanks,
> >> >>>>   Matt
> >> >>>>
> >> >>>> Lukas Vlcek wrote:
> >> >>>>
> >> >>>>> Hi Matthew,
> >> >>>>>
> >> >>>>> I am surious about one thing. How do you know you can just drop
> >> >>>>>
> >> >>> $depth
> >> >>>
> >> >>>>> number of the most oldest segments in the end? I haven't studied
> >> >>>>>
> >> >>> nutch
> >> >>>
> >> >>>>> code regarding this topic yet but I thought that segment can be
> >> >>>>> dropped once you are sure that all its content is already
> >> crawled in
> >> >>>>> some newer segments (which should be checked somehow via some
> >> >>>>> function/script - which hasen't been yet implemented to my
> >> >>>>>
> >> >>> knowledge).
> >> >>>
> >> >>>>> Also I don't think this question has been discussed on dev/user
> >> >>>>>
> >> >>> lists
> >> >>>
> >> >>>>> in detail yet so I just wanted to ask you about your opinion. The
> >> >>>>> situation could get even more complicated if people add -topN
> >> >>>>> parameter into script (which can happen because some might prefer
> >> >>>>> crawling in ten smaller bunches over to two huge crawls due to
> >> >>>>>
> >> >>> various
> >> >>>
> >> >>>>> technical reasons).
> >> >>>>>
> >> >>>>> Anyway, never mind if you don't want to bother about my silly
> >> >>>>>
> >> >>> question
> >> >>>
> >> >>>>> :-)
> >> >>>>>
> >> >>>>> Regards,
> >> >>>>> Lukas
> >> >>>>>
> >> >>>>> On 8/4/06, Matthew Holt <[EMAIL PROTECTED]> wrote:
> >> >>>>>
> >> >>>>>> Last email regarding this script. I found a bug in it that is
> >> >>>>>>
> >> >>> sporadic
> >> >>>
> >> >>>>>> (i think it only affected different setups). However, since it
> >> >>>>>>
> >> >>> would be
> >> >>>
> >> >>>>>> a problem sometimes, I refactored the script. I'd suggest you
> >> >>>>>>
> >> >>> redownload
> >> >>>
> >> >>>>>> the script if you are using it.
> >> >>>>>>
> >> >>>>>> Matt
> >> >>>>>>
> >> >>>>>> Matthew Holt wrote:
> >> >>>>>>
> >> >>>>>>> I'm currently pretty busy at work. If I have I'll do it later.
> >> >>>>>>>
> >> >>>>>>> The version 0.8 recrawl script has a working version online
> >> >>>>>>>
> >> >>> now. I
> >> >>>
> >> >>>>>>> temporarily modified it on the website yesterday when I ran
> >> >>>>>>>
> >> >>> into some
> >> >>>
> >> >>>>>>> problems, but I further tested it and the actual working code is
> >> >>>>>>> modified now. So if you got it off the web site any time
> >> >>>>>>>
> >> >>> yesterday, I
> >> >>>
> >> >>>>>>> would redownload the script.
> >> >>>>>>>
> >> >>>>>>> Matt
> >> >>>>>>>
> >> >>>>>>> Lourival Júnior wrote:
> >> >>>>>>>
> >> >>>>>>>> Hi Matthew!
> >> >>>>>>>>
> >> >>>>>>>> Could you update the script to the version 0.7.2 with the same
> >> >>>>>>>> functionalities? I write a scritp that do this, but it don't
> >> >>>>>>>>
> >> >>> work
> >> >>>
> >> >>>>>> very
> >> >>>>>>
> >> >>>>>>>> well...
> >> >>>>>>>>
> >> >>>>>>>> Regards!
> >> >>>>>>>>
> >> >>>>>>>> On 8/2/06, Matthew Holt <[EMAIL PROTECTED]> wrote:
> >> >>>>>>>>
> >> >>>>>>>>> Just letting everyone know that I updated the recrawl script
> >> >>>>>>>>>
> >> >>> on the
> >> >>>
> >> >>>>>>>>> Wiki. It now merges the created segments them deletes the old
> >> >>>>>>>>>
> >> >>>>>> segs to
> >> >>>>>>
> >> >>>>>>>>> prevent a lot of unneeded data remaining/growing on the hard
> >> >>>>>>>>>
> >> >>> drive.
> >> >>>
> >> >>>>>>>>>   Matt
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >
> >> http://wiki.apache.org/nutch/IntranetRecrawl?action=show#head-e58e25a0b9530b
> >>
> >> > b6fcdfb282fd27a207fc0aff03
> >> >
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>
> >> >>>>
> >> >>>>
> >> >
> >> >
> >> >
> >>
> >
>


-- 
http://JacobBrunson.com

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] [Fwd: Re: 0.8 Recrawl script updated]

Reply via email to