Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "DistributedWebDB" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/DistributedWebDB?action=diff&rev1=2&rev2=3

   
  It's worthwhile to spend a little time on how the distributed WebDBWriter's 
inter-process communication works. No process ever opens a socket to 
communicate directly with another. Rather, all data is communicated via files. 
However, since the WebDBWriters exist over many machines and filesystems, these 
files need to be copied back and forth.
   
- We do that via the NutchFileSystem. Really, "filesystem" is overstating the 
case a little bit. Rather it's a mechanism for a shared file namespace, with 
some automatic file copying between machines that announce interest in that 
namespace.
+ We do that via the NutchDistributedFileSystem. Really, "filesystem" is 
overstating the case a little bit. Rather it's a mechanism for a shared file 
namespace, with some automatic file copying between machines that announce 
interest in that namespace.
   
- Every object under control of a NutchFileSystem machine group is represented 
with a "NutchFile" object. A NutchFile is named using three different 
parameters:
+ Every object under control of a NutchDistributedFileSystem machine group is 
represented with a "NutchFile" object. A NutchFile is named using three 
different parameters:
   
   * "dbName" indicates the overall database that the file belongs to. This 
exists to enable multiple Nutch instances simultaneously on the same machine 
set. All files within a given instance will have the same dbName.
   
-  * "shareGroupName" is used to control where the NutchFile will be copied. 
Clients of the NutchFileSystem ask for NutchFiles via sharegroup. Each machine 
in a NutchFileSystem machine group is configured so that it knows the entire 
(sharegroup<->machine) mapping.  For the purposes of the distributed WebDB, we 
create a sharegroup for each partition of the webdb. When a single WebDBWriter 
is emitting k separate edits files, it is writing to files in k different share 
groups. Having written everything out, the Writer demands to see any files 
meant for its particular segment's sharegroup. (We will also create a "master" 
sharegroup to contain the final db output.)
+  * "shareGroupName" is used to control where the NutchFile will be copied. 
Clients of the NutchDistributedFileSystem ask for NutchFiles via sharegroup. 
Each machine in a NutchDistributedFileSystem machine group is configured so 
that it knows the entire (sharegroup<->machine) mapping.  For the purposes of 
the distributed WebDB, we create a sharegroup for each partition of the webdb. 
When a single WebDBWriter is emitting k separate edits files, it is writing to 
files in k different share groups. Having written everything out, the Writer 
demands to see any files meant for its particular segment's sharegroup. (We 
will also create a "master" sharegroup to contain the final db output.)
   
   * "name" is just an arbitrary slash-separated filename. It describes a 
directory/filename hierarchy for the NutchFile in question.
   
- A NutchFile object can be resolved to a "real-world" disk File with the help 
of a local NutchFileSystem object. Each machine in a NutchFileSystem machine 
group has a NutchFileSystem object that handles configuration and other 
services. One such config value is the place on a local disk where the "root" 
of the NutchFileSystem is found. The disk File embodiment of a NutchFile is a 
combination of that root, the shareGroupName, and the name.
+ A NutchFile object can be resolved to a "real-world" disk File with the help 
of a local NutchDistributedFileSystem object. Each machine in a 
NutchDistributedFileSystem machine group has a NutchDistributedFileSystem 
object that handles configuration and other services. One such config value is 
the place on a local disk where the "root" of the NutchDistributedFileSystem is 
found. The disk File embodiment of a NutchFile is a combination of that root, 
the shareGroupName, and the name.
   
  (Of course, not all sharegroups' files will be present on each machine. That 
depends on the (sharegroup<->machine) mapping.)
   
- The NutchFileSystem also takes care of file moves, deleted, locking, and 
guaranteed atomicity.
+ The NutchDistributedFileSystem also takes care of file moves, deleted, 
locking, and guaranteed atomicity.
   
- It should be clear that the NutchFileSystem can be implemented across any 
group of machines that have mutual remote-access rights. It can also be used 
across machines that share mutual Network File System mounts.
+ It should be clear that the NutchDistributedFileSystem can be implemented 
across any group of machines that have mutual remote-access rights. It can also 
be used across machines that share mutual Network File System mounts.
  

Reply via email to