Thanks Stefan. I am asking because after experimenting with Nutch for a while, I am now about to start setting up a publicly accessible vertical search engine, indexing up to 100 million pages. I know that in the future I have to move to using NDFS anyway, but I am thinking to postpone this if there is a clear transition path from using the Local FileSystem.
What would your general advice in such a case be: to start with NDFS right away? Also, this document http://wiki.apache.org/nutch/NutchDistributedFileSystemsays that NDFS by default uses a replication rate of 2 copies. Based on that, if I want to calculate how much space would be needed for fetching and indexing that amount of pages, how would I go about that? Suppose I setup 3 datanodes, each with 300 GB storage space, and 1 namenode, than that would mean that in practice I would have 900/2 = 450 GB storage. So for 100.000.000 pages, averaging 10 Kb each, I would need up to 2000 GB storage on my datanodes? Thanks for your help. Thomas Delnoij On 11/13/05, Stefan Groschupf <[EMAIL PROTECTED]> wrote: > > Hi, > > Am 13.11.2005 um 12:58 schrieb Thomas Delnoij: > > > I have studied the available documentation and the mailing list > > archive, but > > I could not find an answer to these questions: > > > > 1) is it possible to migrate / convert a Local Filesystem to NDFS? > yes, there is a tool to copy local files to the ndfs > > 2) is it possible to use NDFS without using MapReduce? I would like > > to stick > > to using release 0.7.1. > yes, but the mapred brunch has the latest bug fixes and improvements > for the ndfs. > > 3) what happens to Pages that cannot be parsed (for instance > > content-type: > > image/jpg); are they kept in WebDB or are they removed? > All pages kept in webdb since it is important to know what pages are > already known. > > > > Thanks for your help. > Welcome. > > HTH > Stefan >
