Re: [Nutch-dev] NDFS, DistributedSearch - redundant deployment proposal

ogjunk-nutch Thu, 21 Oct 2004 10:39:44 -0700

No specific technical comments, but this is really starting to sounds
like the Google FS described in one of those Google papers.


Otis


--- Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

> Hi folks,
> 
> First of all, a small clarification about the purpose of NDFS. I had 
> some off-the-list conversation with Michael C., after which I came to
> 
> the following conclusions:
> 
> * currently NDFS is just a simple version of distributed filesystem
> (as
> the javadoc comments say :)). As such, it offers no special support
> for
> Nutch or any other search-specific task. It's just a cheap pure Java
> version of Coda or any other distributed FS.
> 
> * the primary goal for NDFS is to provide an easy way for people to
> handle big amounts of data in WebDB and segments, but only when doing
> operations like DB update, analysis, fetchlist generation, and
> fetching.
> 
> * NDFS is _NOT_ suitable for distributing the search indexes, and
> running search servers on indexes that are put on NDFS will kill the
> performance. Searching requires a fast local access to the index
> files.
> 
> So, currently NDFS helps you only as a distributed storage for
> segments
> data, but it does not address the needs of efficient and redundant 
> deployment of search indexes over a group of search servers (each of 
> them running DistributedSearch$Server, DS$Server for short).
> 
> In other words, you may want to have two separate groups of boxes:
> one 
> group working in NDFS for DB + fetching operations (storage boxes),
> and 
> the other group running DS$Servers (search boxes).
> 
> For a high-performance operation one would always want a setup with 
> multiple DS$Servers. Currently there is no straightforward way to
> ensure 
> redundancy in a DS$Server group - if you lose one of the boxes, a
> part 
> of your index goes offline until you bring another box and put the 
> missing segment + index on it. Also, it takes too much effort and
> manual 
> labor to deploy the segments/indexes to the search nodes.
> 
> I think we need an efficient and automatic way for this, and also to 
> ensure redundancy in a set of search boxes - but in such a way that
> they 
> are complete and usable as local FS-es for DS$Servers (as opposed to
> the 
> way NDFS currently works, because it works on fixed-size blocks of
> bytes).
> 
> My idea is to operate on units that make sense to DS$Server - which 
> means these units must be a part of segment data. Assuming we have
> the 
> tools to cut&paste fragments of segment data as we wish (which we
> have, 
> they just need to be wrapped in command-line tools), I propose the 
> following scenario then:
> 
> 1. we handle the DB updates/fetching/parsing as we originally would,
> perhaps using a block-based NDFS for storage, or a SAN, or somesuch.
> 
> 2. I guess we should dedup the data before distributing it, otherwise
> it would be more difficult - but it would be nice to be able to do it
> in
> step 4...
> 
> 3. then we deploy the segments data to search nodes in the following
> steps:
>       - slice the segments data into units of fixed size
>        (determined from config, or obtained via IPC from individual
>        search nodes). The slicing could be done just linearly, or in
>        a smarter way (e.g. by creating slices sorted by md5 hash)
>       - send each slice to 1 or more DS$Servers (applying some
>        redundancy algo.)
>       - after the segment is deployed, send a command to a
>        selected search node to "mount" the slice.
>       - make a note what are the segments locations (in a similar
>        fashion to the NDFS$NameNode), and which one is active at
>        this moment.
> 
> 4. now I guess there is some work to do on the search server. The
> newly
> received slice needs to be indexed and de-duplicated with the already
> existing older slices on the server. It would be nice to have some
> method to do this across the whole cluster of search servers before
> the
> slices are sent to search servers, but if not, the global
> de-duplication
> must take place in step 2.
> 
> 5. selected search server "mounts" the newly received and indexed
> slice, 
> and makes it available for searching. Optionally, the new slice can
> be 
> merged into a single segment with other already existing slices.
> 
> Much of the logic from NDFS can be reused for selecting the "active" 
> slice, checking the heartbeat and so on.
> 
> Now, if one of the search servers goes down (as detected by heartbeat
> 
> messages), the "name node" sends messages to other search nodes that 
> contain segment replicas of the ones on the failed box. The missing 
> segment is back online now. The "name node" notes that there are too
> few 
> replicas of this segment, and initiates a transfer of this segment to
> 
> one of the other search boxes (again, the same logic already exists
> in 
> NDFS).
> 
> Additional steps are also needed to "populate" newly added blank
> boxes 
> (e.g. when you replace a failed box, or when you want to increase the
> 
> total number of search nodes), and this logic also is already present
> in 
> NDFS.
> 
> Any comments or suggestions are highly appreciated...
> 
> -- 
> Best regards,
> Andrzej Bialecki
> 
> -------------------------------------------------
> Software Architect, System Integration Specialist
> CEN/ISSS EC Workshop, ECIMF project chair
> EU FP6 E-Commerce Expert/Evaluator
> -------------------------------------------------
> FreeBSD developer (http://www.freebsd.org)
> 
> 
> 
> 
> 
> -------------------------------------------------------
> This SF.net email is sponsored by: IT Product Guide on
> ITManagersJournal
> Use IT products in your business? Tell us what you think of them.
> Give us
> Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find
> out more
> http://productguide.itmanagersjournal.com/guidepromo.tmpl
> _______________________________________________
> Nutch-developers mailing list
> [EMAIL PROTECTED]
> https://lists.sourceforge.net/lists/listinfo/nutch-developers
> 



-------------------------------------------------------
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] NDFS, DistributedSearch - redundant deployment proposal

Reply via email to