Re: [Nutch-general] Distributed index

Dennis Kubes Fri, 22 Jun 2007 06:37:13 -0700


Karol Rybak wrote:
>>
>>
>> Karol Rybak wrote:
>> > Hello, i have some questions about nutch in general. I need to create a
>> > simple web crawler, however we want to index a lot of documents it'll
>> > probably be about 100 million in future. I have a couple of servers i
>> can
>>
>> 100 million pages = 50-100 servers and 20-40T of space distributed.
>> Ideally the setup would be processing machines and search servers.  You
>> would have say 50 or so processing machines that would handle the
>> crawling, indexing, mapreduce, and dfs.  Then you would have 50 more
>> somewhat less powered (possibly) servers that handle just serving the
>> search.  You can get away with having the processing and search servers
>> on the same machines but the search will slow down considerably while
>> running large jobs.
> 
> 
> 
> Hello, thanks for your answer, 20-40T of space seems large, the question is
> do you store fetched files, or just indexes ? I don't want to maintain 
> local
> storage, i need only indexing...
>


You need space to stored the fetched documents (segments).  Even when 
compressed, 100M documents takes a lot of space.  You are going to have 
crawldb, linkdb, and indexes which effectively doubles the amount of 
space you need.  This will have to be on a DFS because there is no 
single machine that can handle this load and because raid at this level 
is prohibitively expensive.  On the DFS you are going to replicate your 
data blocks at a minimum 3 times for redundancy so you just tripled your 
space.

You will still need space on the machines for processing the next jobs, 
unless you plan to delete all of the databases and start from scratch 
every time which isn't advised.  So for sorts and other map reduce job 
processing you will want to leave approximately 30% of the space open on 
each box.  Depending on the jobs you are running you may need more.

If you are using the same boxes for search servers you will then have to 
copy the indexes from the DFS to local which again doubles the space 
needed.  The estimate that we use is 100-200G for every 1M pages 
indexed.  You probably can get away with 50G per 1M pages but we have 
large computational jobs that are running and we don't want to run out 
of space.

A rough calculation would be ~4G compressed content per 1M pages fetched 
initially or 4K compressed per fetched page. So 4G * 2 for crawl, link, 
indexing = 8G * 3 for DFS replication = 36G * 1.3 for processing space = 
46.8G + 4G for local indexes = 50.8G.

You said above that you don't want local storage.  Search has to be on 
local file systems.  While you may technically be able to pull a search 
result from the DFS you will almost certainly run out of memory and the 
search will take an excessively long time (minutes, not subsecond) if it 
returns.  Search is a hardware intensive business in part because of the 
number of servers that are needed to handle serving large indexes.

If anybody knows of a better way to setup a search architecture than 
2-4M pages per index per search server I would love to hear about it. 
The former suggestions of space and architecture are what we have 
experienced.

Dennis Kubes

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Distributed index

Reply via email to