Re: [Nutch-general] [Fwd: Re: 0.8 Recrawl script updated]

Lukas Vlcek Tue, 08 Aug 2006 14:46:34 -0700

Hi Matthew,

Thanks for your work on this! And I do apologize if my stupid
questions caused your discomfort. Anyway, I already started testing
former version of your script so I will look closer at your updated
version as well and will keep you posted.


As for the segment merging, the more I search/read on web about it the
more I think it should work as you expected, on the other hand I know
I can't be sure until I hack the source code (hacking nutch is still a
pain for me).

Thanks and regards!
Lukas

On 8/8/06, Matthew Holt <[EMAIL PROTECTED]> wrote:
> It's not needed.. you use the bin/nutch script to generate the initial
> crawl..
>
> details here:
> http://lucene.apache.org/nutch/tutorial8.html#Intranet+Crawling
>
> Fred Tyre wrote:
> > First of all, thanks for the recrawl script.
> > I believe it will save me a few headaches.
> >
> > Secondly, is there a reason that there isn't a crawl script posted on the
> > FAQ?
> >
> > As far as I can tell, you could take your recrawl script and add in the
> > following line after you setup the crawl subdirectories.
> >    $FT_NUTCH_BIN/nutch crawl urls -dir $crawl_dir -threads 2 -depth 3 -topN
> > 50
> >
> > Obviously, the threads, depth and topN could be parameters as well.
> >
> > Thanks again.
> >
> > -----Original Message-----
> > From: Matthew Holt [mailto:[EMAIL PROTECTED]
> > Sent: Tuesday, August 08, 2006 2:00 PM
> > To: [email protected]; [email protected]
> > Subject: Re: [Fwd: Re: 0.8 Recrawl script updated]
> >
> >
> > Since it wasn't really clear whether my script approached the problem of
> > deleting segments correctly, I refactored it so it generates the new
> > number of segments, merges them into one, then deletes the "new"
> > segments. Not as efficient disk space wise, but still removes a large
> > number of the segments that are not being referenced by anything due to
> > not being indexed yet.
> >
> > I reupdated the wiki. Unless there is any more clarification regarding
> > the issue, hopefully I won't have to bombard your inbox with any more
> > emails regarding this.
> >
> > Matt
> >
> > Lukas Vlcek wrote:
> >
> >> Hi again,
> >>
> >> I just found related discussion here:
> >> http://www.nabble.com/NullPointException-tf2045994r1.html
> >>
> >> I think these guys are discussing similar problem and if I understood
> >> the conclusion correctly then the only solution right now is to write
> >> some code and test which segments are used in index and which are not.
> >>
> >> Regards,
> >> Lukas
> >>
> >> On 8/4/06, Lukas Vlcek <[EMAIL PROTECTED]> wrote:
> >>
> >>> Matthew,
> >>>
> >>> In fact I didn't realize you are doing merge stuff (sorry for that)
> >>> but frankly I don't know how exactly merging works and if this
> >>> strategy would work in the long time perspective and whether it is
> >>> universal approach in all variability of cases which may occur during
> >>> crawling (-topN, threads frozen, pages unavailable, crawling dies, ...
> >>> etc), may be it is correct path. I would appreciate if anybody can
> >>> answer this question precisely.
> >>>
> >>> Thanks,
> >>> Lukas
> >>>
> >>> On 8/4/06, Matthew Holt <[EMAIL PROTECTED]> wrote:
> >>>
> >>>> If anyone doesnt mind taking a look...
> >>>>
> >>>>
> >>>>
> >>>> ---------- Forwarded message ----------
> >>>> From: Matthew Holt <[EMAIL PROTECTED]>
> >>>> To: [email protected]
> >>>> Date: Fri, 04 Aug 2006 10:07:57 -0400
> >>>> Subject: Re: 0.8 Recrawl script updated
> >>>> Lukas,
> >>>>    Thanks for your e-mail. I assumed I could drop the $depth number of
> >>>> oldest segments because I first merged them all into one segment
> >>>>
> >>> (which
> >>>
> >>>> I don't drop). Am I incorrect in my assumption and can this cause
> >>>> problems in the future? If so, then I'll go back to the original
> >>>>
> >>> version
> >>>
> >>>> of my script when I kept all the segments without merging. However, it
> >>>> just seemed like if that is the case, it will be a problem after
> >>>>
> >>> enough
> >>>
> >>>> number of recrawls due to the large amount of segments being kept.
> >>>>
> >>>>  Thanks,
> >>>>   Matt
> >>>>
> >>>> Lukas Vlcek wrote:
> >>>>
> >>>>> Hi Matthew,
> >>>>>
> >>>>> I am surious about one thing. How do you know you can just drop
> >>>>>
> >>> $depth
> >>>
> >>>>> number of the most oldest segments in the end? I haven't studied
> >>>>>
> >>> nutch
> >>>
> >>>>> code regarding this topic yet but I thought that segment can be
> >>>>> dropped once you are sure that all its content is already crawled in
> >>>>> some newer segments (which should be checked somehow via some
> >>>>> function/script - which hasen't been yet implemented to my
> >>>>>
> >>> knowledge).
> >>>
> >>>>> Also I don't think this question has been discussed on dev/user
> >>>>>
> >>> lists
> >>>
> >>>>> in detail yet so I just wanted to ask you about your opinion. The
> >>>>> situation could get even more complicated if people add -topN
> >>>>> parameter into script (which can happen because some might prefer
> >>>>> crawling in ten smaller bunches over to two huge crawls due to
> >>>>>
> >>> various
> >>>
> >>>>> technical reasons).
> >>>>>
> >>>>> Anyway, never mind if you don't want to bother about my silly
> >>>>>
> >>> question
> >>>
> >>>>> :-)
> >>>>>
> >>>>> Regards,
> >>>>> Lukas
> >>>>>
> >>>>> On 8/4/06, Matthew Holt <[EMAIL PROTECTED]> wrote:
> >>>>>
> >>>>>> Last email regarding this script. I found a bug in it that is
> >>>>>>
> >>> sporadic
> >>>
> >>>>>> (i think it only affected different setups). However, since it
> >>>>>>
> >>> would be
> >>>
> >>>>>> a problem sometimes, I refactored the script. I'd suggest you
> >>>>>>
> >>> redownload
> >>>
> >>>>>> the script if you are using it.
> >>>>>>
> >>>>>> Matt
> >>>>>>
> >>>>>> Matthew Holt wrote:
> >>>>>>
> >>>>>>> I'm currently pretty busy at work. If I have I'll do it later.
> >>>>>>>
> >>>>>>> The version 0.8 recrawl script has a working version online
> >>>>>>>
> >>> now. I
> >>>
> >>>>>>> temporarily modified it on the website yesterday when I ran
> >>>>>>>
> >>> into some
> >>>
> >>>>>>> problems, but I further tested it and the actual working code is
> >>>>>>> modified now. So if you got it off the web site any time
> >>>>>>>
> >>> yesterday, I
> >>>
> >>>>>>> would redownload the script.
> >>>>>>>
> >>>>>>> Matt
> >>>>>>>
> >>>>>>> Lourival Júnior wrote:
> >>>>>>>
> >>>>>>>> Hi Matthew!
> >>>>>>>>
> >>>>>>>> Could you update the script to the version 0.7.2 with the same
> >>>>>>>> functionalities? I write a scritp that do this, but it don't
> >>>>>>>>
> >>> work
> >>>
> >>>>>> very
> >>>>>>
> >>>>>>>> well...
> >>>>>>>>
> >>>>>>>> Regards!
> >>>>>>>>
> >>>>>>>> On 8/2/06, Matthew Holt <[EMAIL PROTECTED]> wrote:
> >>>>>>>>
> >>>>>>>>> Just letting everyone know that I updated the recrawl script
> >>>>>>>>>
> >>> on the
> >>>
> >>>>>>>>> Wiki. It now merges the created segments them deletes the old
> >>>>>>>>>
> >>>>>> segs to
> >>>>>>
> >>>>>>>>> prevent a lot of unneeded data remaining/growing on the hard
> >>>>>>>>>
> >>> drive.
> >>>
> >>>>>>>>>   Matt
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> > http://wiki.apache.org/nutch/IntranetRecrawl?action=show#head-e58e25a0b9530b
> > b6fcdfb282fd27a207fc0aff03
> >
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>
> >>>>
> >>>>
> >
> >
> >
>

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] [Fwd: Re: 0.8 Recrawl script updated]

Reply via email to