Re: NDFS / WebDB QUestion

Stefan Groschupf Sat, 19 Nov 2005 04:53:37 -0800

Hi Thomas,

I suggest to use the map reduce branch in any case, this require thendfs.

It works quite well, but there are still smaller problems.

The only actually bigger problem is that to store the index itself onthe ndfs is to slow.

But there will be a solution soon. :)


So check out and try the map reduce branche.
HTH
Stefan


Am 18.11.2005 um 17:31 schrieb Thomas Delnoij:

Thanks Stefan.
I am asking because after experimenting with Nutch for a while, Iam nowabout to start setting up a publicly accessible vertical searchengine,indexing up to 100 million pages. I know that in the future I haveto moveto using NDFS anyway, but I am thinking to postpone this if thereis a clear
transition path from using the Local FileSystem.
What would your general advice in such a case be: to start withNDFS right
away?

Also, this document
http://wiki.apache.org/nutch/NutchDistributedFileSystemsays that NDFS
by default uses a replication rate of 2 copies. Based on
that, if I want to calculate how much space would be needed forfetching and
indexing that amount of pages, how would I go about that?
Suppose I setup 3 datanodes, each with 300 GB storage space, and 1namenode,than that would mean that in practice I would have 900/2 = 450 GBstorage.
So for 100.000.000 pages, averaging 10 Kb each, I would need up to2000 GB
storage on my datanodes?

Thanks for your help.

Thomas Delnoij



On 11/13/05, Stefan Groschupf <[EMAIL PROTECTED]> wrote:
Hi,

Am 13.11.2005 um 12:58 schrieb Thomas Delnoij:
I have studied the available documentation and the mailing list
archive, but
I could not find an answer to these questions:

1) is it possible to migrate / convert a Local Filesystem to NDFS?
yes, there is a tool to copy local files to the ndfs
2) is it possible to use NDFS without using MapReduce? I would like
to stick
to using release 0.7.1.
yes, but the mapred brunch has the latest bug fixes and improvements
for the ndfs.
3) what happens to Pages that cannot be parsed (for instance
content-type:
image/jpg); are they kept in WebDB or are they removed?
All pages kept in webdb since it is important to know what pages are
already known.
Thanks for your help.
Welcome.

HTH
Stefan


---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net

Re: NDFS / WebDB QUestion

Reply via email to