Hi Thomas,
I suggest to use the map reduce branch in any case, this require the ndfs.
It works quite well, but there are still smaller problems.

The only actually bigger problem is that to store the index itself on the ndfs is to slow.
But there will be a solution soon. :)


So check out and try the map reduce branche.
HTH
Stefan


Am 18.11.2005 um 17:31 schrieb Thomas Delnoij:

Thanks Stefan.

I am asking because after experimenting with Nutch for a while, I am now about to start setting up a publicly accessible vertical search engine, indexing up to 100 million pages. I know that in the future I have to move to using NDFS anyway, but I am thinking to postpone this if there is a clear
transition path from using the Local FileSystem.

What would your general advice in such a case be: to start with NDFS right
away?

Also, this document
http://wiki.apache.org/nutch/NutchDistributedFileSystemsays that NDFS
by default uses a replication rate of 2 copies. Based on
that, if I want to calculate how much space would be needed for fetching and
indexing that amount of pages, how would I go about that?

Suppose I setup 3 datanodes, each with 300 GB storage space, and 1 namenode, than that would mean that in practice I would have 900/2 = 450 GB storage.

So for 100.000.000 pages, averaging 10 Kb each, I would need up to 2000 GB
storage on my datanodes?

Thanks for your help.

Thomas Delnoij



On 11/13/05, Stefan Groschupf <[EMAIL PROTECTED]> wrote:

Hi,

Am 13.11.2005 um 12:58 schrieb Thomas Delnoij:

I have studied the available documentation and the mailing list
archive, but
I could not find an answer to these questions:

1) is it possible to migrate / convert a Local Filesystem to NDFS?
yes, there is a tool to copy local files to the ndfs
2) is it possible to use NDFS without using MapReduce? I would like
to stick
to using release 0.7.1.
yes, but the mapred brunch has the latest bug fixes and improvements
for the ndfs.
3) what happens to Pages that cannot be parsed (for instance
content-type:
image/jpg); are they kept in WebDB or are they removed?
All pages kept in webdb since it is important to know what pages are
already known.

Thanks for your help.
Welcome.

HTH
Stefan


---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net


Reply via email to