Hi  Andrzej,


A distribuite file system is my dream, but I think before to scale Nutch need stabilty. From my experience (started from the first Nutch release) on many Linux boxes (Redhat 7-8-9, Fedora1-2, dual processors, single..) with many ram 8GB, many disk spaces (1,6TB, raid5), many java version, with the updatedb is impossible to work on a Webdb of 100.000.000/150.000.000 of urls: is like a lottery.
I have tryed about 20-30 times without succes. Some time the updatedb try to write data out of disk blocks space (see my messages in the list) aborting the Ext3 journaling. I think after the demo deployed by Doug and old core Nutch developers team, my Db was the bigger (50.000.000 pages on 6 segments and about 150.000.000 urls in WebDb), but it was the limit, because updatedb refuse to update my db.
Noted that I'm a sys admin, so no problems with Sysconf: openfiles, socket Jvm memory and so on.
This is only my opinion.


Thx,

Massimo

Andrzej Bialecki wrote:

Hi folks,

First of all, a small clarification about the purpose of NDFS. I had some off-the-list conversation with Michael C., after which I came to the following conclusions:

* currently NDFS is just a simple version of distributed filesystem (as
the javadoc comments say :)). As such, it offers no special support for
Nutch or any other search-specific task. It's just a cheap pure Java
version of Coda or any other distributed FS.

* the primary goal for NDFS is to provide an easy way for people to
handle big amounts of data in WebDB and segments, but only when doing
operations like DB update, analysis, fetchlist generation, and fetching.

* NDFS is _NOT_ suitable for distributing the search indexes, and
running search servers on indexes that are put on NDFS will kill the
performance. Searching requires a fast local access to the index files.

So, currently NDFS helps you only as a distributed storage for segments
data, but it does not address the needs of efficient and redundant deployment of search indexes over a group of search servers (each of them running DistributedSearch$Server, DS$Server for short).


In other words, you may want to have two separate groups of boxes: one group working in NDFS for DB + fetching operations (storage boxes), and the other group running DS$Servers (search boxes).

For a high-performance operation one would always want a setup with multiple DS$Servers. Currently there is no straightforward way to ensure redundancy in a DS$Server group - if you lose one of the boxes, a part of your index goes offline until you bring another box and put the missing segment + index on it. Also, it takes too much effort and manual labor to deploy the segments/indexes to the search nodes.

I think we need an efficient and automatic way for this, and also to ensure redundancy in a set of search boxes - but in such a way that they are complete and usable as local FS-es for DS$Servers (as opposed to the way NDFS currently works, because it works on fixed-size blocks of bytes).

My idea is to operate on units that make sense to DS$Server - which means these units must be a part of segment data. Assuming we have the tools to cut&paste fragments of segment data as we wish (which we have, they just need to be wrapped in command-line tools), I propose the following scenario then:

1. we handle the DB updates/fetching/parsing as we originally would,
perhaps using a block-based NDFS for storage, or a SAN, or somesuch.

2. I guess we should dedup the data before distributing it, otherwise
it would be more difficult - but it would be nice to be able to do it in
step 4...

3. then we deploy the segments data to search nodes in the following steps:
- slice the segments data into units of fixed size
(determined from config, or obtained via IPC from individual
search nodes). The slicing could be done just linearly, or in
a smarter way (e.g. by creating slices sorted by md5 hash)
- send each slice to 1 or more DS$Servers (applying some
redundancy algo.)
- after the segment is deployed, send a command to a
selected search node to "mount" the slice.
- make a note what are the segments locations (in a similar
fashion to the NDFS$NameNode), and which one is active at
this moment.


4. now I guess there is some work to do on the search server. The newly
received slice needs to be indexed and de-duplicated with the already
existing older slices on the server. It would be nice to have some
method to do this across the whole cluster of search servers before the
slices are sent to search servers, but if not, the global de-duplication
must take place in step 2.

5. selected search server "mounts" the newly received and indexed slice, and makes it available for searching. Optionally, the new slice can be merged into a single segment with other already existing slices.

Much of the logic from NDFS can be reused for selecting the "active" slice, checking the heartbeat and so on.

Now, if one of the search servers goes down (as detected by heartbeat messages), the "name node" sends messages to other search nodes that contain segment replicas of the ones on the failed box. The missing segment is back online now. The "name node" notes that there are too few replicas of this segment, and initiates a transfer of this segment to one of the other search boxes (again, the same logic already exists in NDFS).

Additional steps are also needed to "populate" newly added blank boxes (e.g. when you replace a failed box, or when you want to increase the total number of search nodes), and this logic also is already present in NDFS.

Any comments or suggestions are highly appreciated...



-------------------------------------------------------
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to