On Nov 26, 2011, at 1:48 AM, Nathan Kurz wrote: > Dan -- > > I took a glance. Sounds promising. Could you talk a bit about the use > case you at anticipating? > What are you indexing?
Best way to describe what i plan to and currently do. 50% name/valued key pair.. and 50% full text search. Both have rather large documents. Lots of sort fields. I truly do abuse the crap out of KinoSearch currently. Highlighting is about the only thing i don't use heavily. > How fast is it > changing? I'm thinking avg. number of changes will be about ~15 a second. During more bulky style changes... I hope much faster. > Do the shards fit in memory? Yes and no... Will have some servers with low query requirements overloaded to disk.. High profile Indexes with low search SLA's yes. > What's a ballpark for the searches > per second you'd like to handle? 1k/second (name/value style searches) with the 98 percentile search under 30ms. 1k/second (full text with nasty OR query's/w large posting files) with the 98 percentile search under 300ms. > My first thought is that you may be able to trade off some latency for > increased throughput by sticking with partially serialized requests if you > were able to pass a threshold score along to each node/shard so you could > speed past low scoring results. More detail! I'm thinking your talking about the top_docs call getting to use a hinted low watermark in it's priority queue? if so.. i was chatting with marvin about this the other day.. i was scared with creating a cluster with 100 nodes. On reason was the sheer number of docs i would need to push over the network with num_wanted => 10, offset =>200. The thing that killed the idea for me... How do i generate the low watermark we pass to nodes without getting data back from one node? So the way i fixed the 100 shard problem (in my head) is i built a pyramid of MultiSearchers this doesn't really work either and i think now makes it worse.. I'm thinking by time i start worrying about 100+ nodes sampling and early termination will be a must. Another crazy idea while i have you this far off track. top_doc requests go into this "pool" all 100 nodes try to run the queries in the pool and places the score of the lowest scoring doc into the response pool for that node. the the top_docs query submitter can decide how long to wait for a responses/how many responses to wait for.. and knows what nodes he will need to use top_docs from. So what i'm doing in lucy_cluster is not trying to solve the 100node issue just yet.. and keeping the number of nodes small < 10. But at the same time keeping the number of shards about 30ish. Mainly so i can rebalance nodes my just moving shards... and nodes can search more than one shard locally with a multi-searcher. > But this brings up Marvin's points about > how to handle distributed TF/IDF... > This is *easy* to solve on a per-Index basis with insider knowledge about the index and how it's segmented. Doing it perfectly for everyone and fast sounds hard. Spreading out the cost of cacheing/updating the TF/IDF i think is key. I like the idea of sampling a node or 2 to get the cache started (service the search) and then finish the cache out of band to get a better more complete picture. Unless your adding/updating to a index with all new term mix quickly.. i don't think the TF/idf cache needs to move quickly. -Dan
