Hi Thomas,
I suggest to use the map reduce branch in any case, this require the
ndfs.
It works quite well, but there are still smaller problems.
The only actually bigger problem is that to store the index itself on
the ndfs is to slow.
But there will be a solution soon. :)
So check out and try the map reduce branche.
HTH
Stefan
Am 18.11.2005 um 17:31 schrieb Thomas Delnoij:
Thanks Stefan.
I am asking because after experimenting with Nutch for a while, I
am now
about to start setting up a publicly accessible vertical search
engine,
indexing up to 100 million pages. I know that in the future I have
to move
to using NDFS anyway, but I am thinking to postpone this if there
is a clear
transition path from using the Local FileSystem.
What would your general advice in such a case be: to start with
NDFS right
away?
Also, this document
http://wiki.apache.org/nutch/NutchDistributedFileSystemsays that NDFS
by default uses a replication rate of 2 copies. Based on
that, if I want to calculate how much space would be needed for
fetching and
indexing that amount of pages, how would I go about that?
Suppose I setup 3 datanodes, each with 300 GB storage space, and 1
namenode,
than that would mean that in practice I would have 900/2 = 450 GB
storage.
So for 100.000.000 pages, averaging 10 Kb each, I would need up to
2000 GB
storage on my datanodes?
Thanks for your help.
Thomas Delnoij
On 11/13/05, Stefan Groschupf <[EMAIL PROTECTED]> wrote:
Hi,
Am 13.11.2005 um 12:58 schrieb Thomas Delnoij:
I have studied the available documentation and the mailing list
archive, but
I could not find an answer to these questions:
1) is it possible to migrate / convert a Local Filesystem to NDFS?
yes, there is a tool to copy local files to the ndfs
2) is it possible to use NDFS without using MapReduce? I would like
to stick
to using release 0.7.1.
yes, but the mapred brunch has the latest bug fixes and improvements
for the ndfs.
3) what happens to Pages that cannot be parsed (for instance
content-type:
image/jpg); are they kept in WebDB or are they removed?
All pages kept in webdb since it is important to know what pages are
already known.
Thanks for your help.
Welcome.
HTH
Stefan
---------------------------------------------------------------
company: http://www.media-style.com
forum: http://www.text-mining.org
blog: http://www.find23.net