Hi all, I¹m building a vertical search index of about 300-500 million web pages (mostly articles). So I¹m trying to figure out what kind of hardware I need for both backend crawl/index build servers and the frontend search servers. I would assume I¹ll need to spider about 5 million pages per day.
>From my own experience with Lucene I think that a single processor should be able to serve about 10 million pages at an acceptable rate (~5 queries per second). So my current assumption is that 5 dual quad core servers with 32 GB of RAM each and a total of 5 TB of Disk on a SAN should be about what I need to process queries. Does that seem about right ? Or should I opt for local drives (if yes, how many ? Striped ?) or more or fewer cores per server ? I am unsure about what I need for the Hadoop cluster for building/updating the index though. According to this (pre-Hadoop) article: http://www.acmqueue.com/modules.php?name=Content&pa=showpage&pid=144 I¹d need a single processor server with 1 GB of RAM and 1 TB of storage across 8 drives on a RAID controller to handle 100 million pages. But that is pre-Hadoop. What is the current best practice ? 1GB of RAM also sounds rather small to me and I would think you¹d need more than one processor per 100 million pages. Does it make sense to get a few servers with many cores, or a lot of servers with single or dual core processors ? And what specs ? I¹d love to hear what kind of hardware other people are running this size of index on. Thanks, Stefan
