Re: [lucy-dev] ClusterSearcher

Dan Markham Sun, 27 Nov 2011 01:46:20 -0800

On Nov 26, 2011, at 1:48 AM, Nathan Kurz wrote:

> Dan --
> 
> I took a glance.  Sounds promising.  Could you talk a bit about the use
> case you at anticipating?  
> What are you indexing?


Best way to describe what i plan to and currently do.  50% name/valued key 
pair.. and 50% full text search.
Both have rather large documents. Lots of sort fields. I truly do abuse the 
crap out of KinoSearch currently. Highlighting is about the only thing i don't 
use heavily.

> How fast is it
> changing?
I'm thinking avg. number of changes will be about ~15 a second.
During more bulky style changes... I hope much faster.


>  Do the shards fit in memory?  
Yes and no...
Will have some servers with low query requirements overloaded to disk.. 
High profile Indexes with low search SLA's yes.


> What's a ballpark for the searches
> per second you'd like to handle?

1k/second (name/value style searches) with the 98 percentile search under 30ms.
1k/second (full text with nasty OR query's/w large posting files) with the 98 
percentile search under 300ms.
 
> My first thought is that you may be able to trade off some latency for
> increased throughput by sticking with partially serialized requests if you
> were able to pass a threshold score along to each node/shard so you could
> speed past low scoring results.

More detail!
I'm thinking your talking about the top_docs call getting to use a hinted low 
watermark in it's priority queue?

if so.. i was chatting with marvin about this the other day.. i was scared with 
creating a  cluster with 100 nodes.
On reason was the sheer number of docs i would need to push over the network 
with num_wanted => 10, offset =>200.

The thing that killed the idea for me...
How do i generate the low watermark we pass to nodes without getting data back 
from one node?

So the way i fixed the 100 shard problem (in my head) is i built a pyramid of 
MultiSearchers this doesn't really work either and i think  now makes it 
worse.. I'm thinking by time i start worrying about 100+ nodes sampling and 
early termination will be a must. Another crazy idea while i have you this far 
off track. top_doc requests go into this "pool" all 100 nodes try to run the 
queries in the pool and places the score of the lowest scoring doc into the 
response pool for that node. the the top_docs query submitter can decide how 
long to wait for a responses/how many responses to wait for.. and knows what 
nodes he will need to use top_docs from.




So what i'm doing in lucy_cluster is not trying to solve the 100node issue just 
yet.. and keeping the number of nodes small < 10. But at the same time keeping 
the number of shards about 30ish. Mainly so i can rebalance nodes my just 
moving shards... and nodes can search more than one shard locally with a 
multi-searcher.  

>  But this brings up Marvin's points about
> how to handle distributed TF/IDF...
> 

This is *easy* to solve on a per-Index basis with insider knowledge about the 
index and how it's segmented. Doing it perfectly for everyone and fast sounds 
hard. Spreading out the cost of cacheing/updating the TF/IDF i think is key. 
I like the idea of  sampling a node or 2 to get the cache started (service the 
search) and then finish the cache out of band to get a better more complete 
picture. Unless your adding/updating to a index with all new term mix quickly.. 
i don't think the TF/idf cache needs to move quickly. 


-Dan

Re: [lucy-dev] ClusterSearcher

Reply via email to