Gal Nitzan schrieb:
1. If NDFS is too slow and all data must be copied to HD FS why use it
in the first place?
NDFS is more or less part of the map/reduce system. It's needed because
you have to store a large amount of data in a way that all tasktrackers
can access it. Another reason is the realiability of the map/reduce
system. With the default settings each block of the NDFS is replicated
on three different machines. When machines fail the system is still able
run jobs. The tasktrackers copy the small chunk of data to their local
disk to have fast access when running a task and later the results are
copied back into the NDFS.
When you want to search the data you need fast access to the index and
also to the segments used in that index. This is why you want to copy
those data out of the NDFS on the local disk of the search nodes.
2. If using NDFS and HD don you get 4 copies of the same data?
Yes, and when running map/reduce jobs you also get a lot of temporal
data, too. As said before the reduncy is needed for reliability and it
can also increase the performance of the map/reduce system.
3. Assuming the data is 3 TB, how do you split the data to be read by
the searcher when not using NDFS?
You can create multiple indexes and use multiple search servers. You
copy each of these indexes with it's segments to one the search servers.
See for example
http://wiki.media-style.com/display/nutchDocu/setup+multiple+search+sever
for more details.
best regards,
Dominik
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general