Karol Rybak wrote: >> >> >> Karol Rybak wrote: >> > Hello, i have some questions about nutch in general. I need to create a >> > simple web crawler, however we want to index a lot of documents it'll >> > probably be about 100 million in future. I have a couple of servers i >> can >> >> 100 million pages = 50-100 servers and 20-40T of space distributed. >> Ideally the setup would be processing machines and search servers. You >> would have say 50 or so processing machines that would handle the >> crawling, indexing, mapreduce, and dfs. Then you would have 50 more >> somewhat less powered (possibly) servers that handle just serving the >> search. You can get away with having the processing and search servers >> on the same machines but the search will slow down considerably while >> running large jobs. > > > > Hello, thanks for your answer, 20-40T of space seems large, the question is > do you store fetched files, or just indexes ? I don't want to maintain > local > storage, i need only indexing... >
You need space to stored the fetched documents (segments). Even when compressed, 100M documents takes a lot of space. You are going to have crawldb, linkdb, and indexes which effectively doubles the amount of space you need. This will have to be on a DFS because there is no single machine that can handle this load and because raid at this level is prohibitively expensive. On the DFS you are going to replicate your data blocks at a minimum 3 times for redundancy so you just tripled your space. You will still need space on the machines for processing the next jobs, unless you plan to delete all of the databases and start from scratch every time which isn't advised. So for sorts and other map reduce job processing you will want to leave approximately 30% of the space open on each box. Depending on the jobs you are running you may need more. If you are using the same boxes for search servers you will then have to copy the indexes from the DFS to local which again doubles the space needed. The estimate that we use is 100-200G for every 1M pages indexed. You probably can get away with 50G per 1M pages but we have large computational jobs that are running and we don't want to run out of space. A rough calculation would be ~4G compressed content per 1M pages fetched initially or 4K compressed per fetched page. So 4G * 2 for crawl, link, indexing = 8G * 3 for DFS replication = 36G * 1.3 for processing space = 46.8G + 4G for local indexes = 50.8G. You said above that you don't want local storage. Search has to be on local file systems. While you may technically be able to pull a search result from the DFS you will almost certainly run out of memory and the search will take an excessively long time (minutes, not subsecond) if it returns. Search is a hardware intensive business in part because of the number of servers that are needed to handle serving large indexes. If anybody knows of a better way to setup a search architecture than 2-4M pages per index per search server I would love to hear about it. The former suggestions of space and architecture are what we have experienced. Dennis Kubes ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
