Re: [Nutch-general] Index updates between machines

Chun Wei Ho Fri, 06 Apr 2007 19:13:41 -0700

For our index update:
We perform a copy (cp -R) of the index directory over NFS between the
crawler and searcher machines every few hours. The copy is initiated
by the crawler application itself (not scheduled by cron).


On the searcher, the destination of the copied index is a timestamp
named directory. A flag file created at the end of the copy signifies
that the copy is complete. The searcher application has a thread that
periodically peruses this directory looking for a timestamp newer than
the current index and has a complete flag file. If found, the searcher
will switch over to the new index (simply replacing the index instance
in the NutchBean used), perform some prewarming queries (although
given the diverseness of our 15GB index, it is rather difficult to
prewarm it), and after a set time, remove the old index.


Thanks for the ideas in the other subthreads too. Right now we are
testing out the various methods to see if they work with our current
set up, but feel free to drop more hints :)


Regards,
CW

On 4/4/07, cybercouf <[EMAIL PROTECTED]> wrote:
>
> Unfortunately I don't have lots of solutions for you, because I'm still not
> having a so big index! But it sounds like the weak point is disk access
> during the copy?
> Try to cache the index in memory? (needs a lot of ram!)
> Or having two HDD on your searcher, one for current index, the other for
> 'incoming index' (so during the copy the other drive can still have good
> access time)
> Or having a really big and powerful RAID system with SCSI 15k disks?
>
> I'm really interested to heard more details about your automated
> configuration, what kind of copy you use? (crontab? batch?) how do you
> detect the new index on the searcher? what are your merge tactics?, etc ...
> (If you can share I'm sure it will be useful lots of nutch beginners like
> me! thanks)
>
> thanks
>
>
> Chun Wei Ho wrote:
> >
> > We are running a search service on the internet using two machines. We
> > have a crawler machine which crawls the web and merges new documents
> > found into the Lucene index. We have a searcher machine which allows
> > users to perform searches on the Lucene index.
> >
> > Periodically, we would copy the newest version of the index from the
> > crawler machine over to the searcher machine (via copy over a NFS
> > mount). The searcher would then detect the new version, close the old
> > index, open the new index and resume the search service.
> >
> > As the index have been growing in size, we have been noticing that the
> > search response time on the searcher machine increases drastically
> > when an index (about 15GB) is being copied from the crawler to the
> > searcher. Both machines run Fedora Core 4 and are on a gbps lan.
> >
> > We've tried a number of ways to reduce the impact of the copy over NFS
> > on searching performance, such as "nice"ing the copy process, but to
> > no avail. I wonder if anyone is running a lucene search service over a
> > similar architecture and how you are managing the updates to the
> > lucene index.
> >
> > Thanks!
> >
> > Regards,
> > CW
> >
> >
>
> --
> View this message in context: 
> http://www.nabble.com/Index-updates-between-machines-tf3514574.html#a9816799
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Index updates between machines

Reply via email to