Hi thanks for your answer, comments inline. On Mon, Jun 29, 2009 at 10:06 AM, eks dev <eks...@yahoo.co.uk> wrote:
> > depends on your architecture, will you partition your index? What is max > expected size of your index (you said 128G and growing..) what do you mean > with growing? You have in both options enogh memory to load it into RAM... Yes we partition the index, with a simple RoundRobin algo. The options was just to give the reader some visibility in what kind of hardware you get depending on which path you choose. I do not really have that amount of money to spend right now. More like 1/6th of that really. We crawl blogs... The number of blogs we find is still increasing and we are not nearly indexing all languages => It will grow at least linear. Let's say at least 10-20G a month or so ? > > > I would definitly try to have less machines and alot of memory, so that > your index fits into ram comfortably... OK, so you mean that one should aim for fitting the shard into RAM... > > > IMO, 8Gig per machine is rather smalish, but depends heavily on your access > patterns... how many documents you need to load from disk per query? If this > does not create huge on IO, you could try to load everything but stored > fields into RAM We store no fields in the index besides the actual DB id. We load no more than 50 docs at a time. > > > What are your requirements on Indexing side (once a day, week, 15 Minutes), > how you distribute index to all these machines... We index all non office hours. > > > Your question: IO or CPU bound, depends, if you load it into RAM it becomes > Memeory-bus/CPU bound, if it is mainly on disk it will be IO bound OK like I suspected, answers my previous question(s). Final question: Based on your findings what is the most challenging part to tune ? Sorting or querying or what else? //Marcus > > > > > > > ----- Original Message ---- > > From: Marcus Herou <marcus.he...@tailsweep.com> > > To: java-user@lucene.apache.org > > Sent: Monday, 29 June, 2009 9:47:13 > > Subject: Re: Scaling out/up or a mix > > > > Thanks for the answer. > > > > Don't you think that part 1 of the email would give you a hint of nature > of > > the index ? > > > > Index size(and growing): 16Gx8 = 128G > > Doc size (data): 20k > > Num docs: 90M > > Num users: Few hundred but most critical is that the admin staff which is > > using the index all day long. > > Query types: Example: title:"Iphone" OR description:"Iphone" sorted by > > publishedDate... = Very simple, no fuzzy searches etc. However since the > > dataset is large it will consume memory on sorting I guess. > > > > Could not one draw any conclusions about best-practice in terms of > hardware > > given the above "specs" ? > > > > Basically I would like to know if I really need 8 cores since machines > with > > dual-cpu support are the most expensive and I would like to not throw > away > > money so getting it right is a matter of economy. > > > > I mean it is very simple: Let's say someone gives me a budget of 50 000 > USD > > and I then want to get the most bang for the buck for my workload. > > Should I go for > > X machines with quad-core 3.0GHz, 4 disks RAID1+0, 8G RAM costing me > 1200USD > > a piece (giving me 40 machines: 160 disks, 160 cores, 320G RAM) > > or > > X machines with dual quad-core 2.0GHz, 4 disks RAID1+0, 36G RAM costing > me > > 3400 USD a piece (giving me 15 machines: 60 disks, 120 cores, 540G RAM) > > > > Basically I would like to know what factors make the workload IO bound vs > > CPU bound ? > > > > //Marcus > > > > > > > > > > > > > > On Mon, Jun 29, 2009 at 8:53 AM, Eric Bowman wrote: > > > > > There is no single answer -- this is always application specific. > > > > > > Without knowing anything about what you are doing: > > > > > > 1. disk i/o is probably the most critical. Go SSD or even RAM disk if > > > you can, if performance is absolutely critical > > > 2. Sometimes CPU can become an issue, but 8 cores is probably enough > > > unless you are doing especially cpu-bound searches. > > > > > > Unless you are doing something with hard performance requirements, or > > > really quite unusual, buying "good" kit is probably good enough, and > you > > > won't really know for sure until you measure. Lucene is a general > > > enough tool that there isn't a terribly universal answer to this. We > > > were a bit surprised to end up cpu-bound instead of disk i/o-bound, for > > > instance, but we ended up taking an unusual path. YMMV. > > > > > > Marcus Herou wrote: > > > > Hi. I think I need to be more specific. > > > > > > > > What I am trying to find out is if I should aim for: > > > > > > > > CPU (2x4 cores, 2.0-3.0Ghz)? or perhaps just a 4 cores is enough. > > > > Fast disk IO: 8 disks, RAID1+0 ? or perhaps 2 disks is enough... > > > > RAM - if the index does not fit into RAM how much RAM should I then > buy ? > > > > > > > > Please any hints would be appreciated since I am going to invest > soon. > > > > > > > > //Marcus > > > > > > > > On Sat, Jun 27, 2009 at 12:00 AM, Marcus Herou > > > > wrote: > > > > > > > > > > > >> Hi. > > > >> > > > >> I currently have an index which is 16GB per machine (8 machines = > 128GB) > > > >> (data is stored externally, not in index) and is growing like crazy > (we > > > are > > > >> indexing blogs which is crazy by nature) and have only allocated 2GB > per > > > >> machine to the Lucene app since we are running some other stuff > there in > > > >> parallell. > > > >> > > > >> Each doc should be roughly the size of a blog post, no more than > 20k. > > > >> > > > >> We currently have about 90M documents and it is increasing rapidly > so > > > >> getting into the G+ document range is not going to be too far away. > > > >> > > > >> Now due to search performance I think I need to move these instances > to > > > >> dedicated index/search machines (or index on some machines and > search on > > > >> others). Anyway I would like to get some feedback about two things: > > > >> > > > >> 1. What is the most important hardware aspect when it comes to add > > > document > > > >> to the index and optimize it. > > > >> 1.1 Is it disk I|O write throghput ? (sequential or random-io ?) > > > >> 1.2 Is it RAM ? > > > >> 1.3 Is is CPU ? > > > >> > > > >> My guess would be disk-io, right, wrong ? > > > >> > > > >> 2. What is the most important hardware aspect when it comes to > searching > > > >> documents in my setup ? (result-set is limited to return only the > top 10 > > > >> matches with page handling) > > > >> 2.1 Is it disk read throughput ? (sequential or random-io ?) > > > >> 2.2 Is it RAM ? > > > >> 2.3 Is is CPU ? > > > >> > > > >> I have no clue since the data might not fit into memory. What is > then > > > the > > > >> most important factor ? read-performance while scanning the index ? > CPU > > > >> while comparing fields and collecting results ? > > > >> > > > >> What I'm trying to find out is what I can do to get most bang for > the > > > buck > > > >> with a limited (aren't we all limited?) budget. > > > >> > > > >> Kindly > > > >> > > > >> //Marcus > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> -- > > > >> Marcus Herou CTO and co-founder Tailsweep AB > > > >> +46702561312 > > > >> marcus.he...@tailsweep.com > > > >> http://www.tailsweep.com/ > > > >> > > > >> > > > >> > > > > > > > > > > > > > > > > > > > > > -- > > > Eric Bowman > > > Boboco Ltd > > > ebow...@boboco.ie > > > http://www.boboco.ie/ebowman/pubkey.pgp > > > > > +35318394189/+353872801532 > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > > > > > -- > > Marcus Herou CTO and co-founder Tailsweep AB > > +46702561312 > > marcus.he...@tailsweep.com > > http://www.tailsweep.com/ > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/