Thanks Stefan.

I am asking because after experimenting with Nutch for a while, I am now
about to start setting up a publicly accessible vertical search engine,
indexing up to 100 million pages. I know that in the future I have to move
to using NDFS anyway, but I am thinking to postpone this if there is a clear
transition path from using the Local FileSystem.

What would your general advice in such a case be: to start with NDFS right
away?

Also, this document
http://wiki.apache.org/nutch/NutchDistributedFileSystemsays that NDFS
by default uses a replication rate of 2 copies. Based on
that, if I want to calculate how much space would be needed for fetching and
indexing that amount of pages, how would I go about that?

Suppose I setup 3 datanodes, each with 300 GB storage space, and 1 namenode,
than that would mean that in practice I would have 900/2 = 450 GB storage.

So for 100.000.000 pages, averaging 10 Kb each, I would need up to 2000 GB
storage on my datanodes?

Thanks for your help.

Thomas Delnoij



On 11/13/05, Stefan Groschupf <[EMAIL PROTECTED]> wrote:
>
> Hi,
>
> Am 13.11.2005 um 12:58 schrieb Thomas Delnoij:
>
> > I have studied the available documentation and the mailing list
> > archive, but
> > I could not find an answer to these questions:
> >
> > 1) is it possible to migrate / convert a Local Filesystem to NDFS?
> yes, there is a tool to copy local files to the ndfs
> > 2) is it possible to use NDFS without using MapReduce? I would like
> > to stick
> > to using release 0.7.1.
> yes, but the mapred brunch has the latest bug fixes and improvements
> for the ndfs.
> > 3) what happens to Pages that cannot be parsed (for instance
> > content-type:
> > image/jpg); are they kept in WebDB or are they removed?
> All pages kept in webdb since it is important to know what pages are
> already known.
> >
> > Thanks for your help.
> Welcome.
>
> HTH
> Stefan
>

Reply via email to