Re: Aggregating category hits

zzzzz shalev Mon, 29 May 2006 15:44:43 -0700

i know im a little late replying to this thread, but, in my humble opinion the 
best way to aggregate values (not necessarily terms, but whole values in 
fields) is as follows:
   
  startup stage:
   
  for each field you would like to aggregate create a hashmap
   
  open an index reader and run through all the docs
   
  get the values to be aggregated from the fields of each doc
   
  create a hashcode for each value from each field collected, the hashcode 
should have some sort of prefix indicating which field its from (for exampe: 1 
= author, 2 = ....) and hence which hash it is stored in (at retrieval time, 
this prefix can be used to easily retrieve the value from the correct hash)
   
  place the hashcode/value in the appropriate hash
   
  create an arraylist
   
  at index X in the arraylist place an int array of all the hashcodes 
associated with doc id X
   
  so for example: if i have doc id 0 which contains the values: william 
shakespeare and the value 1797 the array list at index 0 will have an int array 
containing 2 values (the 2 hashcodes of shaklespeare and 1797)
   
  run time:
   
  at run time receive the hits and iterate through the doc ids , aggregate the 
values with direct access into the arraylist (for doc id 10 go to index 10 in 
the arraylist to retrieve the array of hashcodes) and lookups into the hashmaps
   
  i tested this today on a small index approx 400,000 docs (1GB of data) but i 
ran queries returning over 100,000 results
   
  my response time was about 550 milliseconds on large (over 100,000) result 
sets
   
  another point, this method should be scalable for much larger indexes as 
well, as it is linear to the result set size and not the index size (which is a 
HUGE bonus)
   
  if anyone wants the code let me know,


Marvin Humphrey <[EMAIL PROTECTED]> wrote:
  
Thanks, all.

The field cache and the bitsets both seem like good options until the 
collection grows too large, provided that the index does not need to 
be updated very frequently. Then for large collections, there's 
statistical sampling. Any of those options seems preferable to 
retrieving all docs all the time.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



                
---------------------------------
Feel free to call! Free PC-to-PC calls. Low rates on PC-to-Phone.  Get Yahoo! 
Messenger with Voice

Re: Aggregating category hits

Reply via email to