On Nov 8, 2012, at 11:30 AM, Robert Muir <[email protected]> wrote:

Thanks everybody for response, and  much more of the same for the great project


> Why are you retrieving thousands of stored fields?


 I do not  think it is all that rare that people  actually do something with 
information  but display summaries?  
Clustering in solr does exactly that,  online record linkage follows exactly 
the same pattern. 
 
A pattern "fetch thousands of candidates and run some heavy processing on them" 
is surely not  a typical "web search engine"  usage, but  philosophically,  a 
model:
a) search data 
b) do something with it
c) deliver 
is not that strange?

You say, b) should not be done using stored fields, ok I trust you, but going 
to database/nosql/anything  is even slower. What approach would you recommend? 


"the probability of two documents of the same results page being in the same 
chunk is very low."

Adrian, Robert, this is 100% correct, no objection there.   
In this particular case we are using locality of reference heavily.  We simply 
sort the data and reindex from time to time. You have to be lucky to be able to 
sort the documents, but  we do not use lucene for big chunks of text, rather 
for almost fully structured data and we know how to sort this data to preserve 
locality of reference… Also a bit unusual, but  I do not think all that rare 
scenario. 
Sorting data (where possible) was a great optimisation tip for many 
applications, even before compression.


"really you should roll your own codec for this and specialise."

Yes, already started thinking about it, but we  will first try to play with 
chunk size to see if we can achieve the goal without own codec …



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to