RE: grouping results by fields

zzzzz shalev Mon, 30 Jan 2006 13:59:01 -0800

thanks for the advice guys!
   
  currently , i am iterating through about 200-300 of the top docs and creating 
the groups (so, as of now, the groups are partial) , my response time HAS to be 
at most 500-600 milli (query + groupings) or my company will probably go with a 
commercial search engine such as FAST or something of the sort
   
  the data i must prove lucene can handle is up to 25 million 2-4Kb docs, i 
dont think it is feasable to creat groups on a result set consisting of a few 
thousand results in a 1/2 second, or am i wrong?
   
  i will try both solutions , and hope for the best
   
  in any case much much thanks!


Chris Hostetter <[EMAIL PROTECTED]> wrote:
  
An approach like mark is describing sould should be a lot more space
efficient then the BitSet intersection approach i described before, but
depending on how many groupings you want, i can immagine that it might be
slower some cases.

Unfortunately, it also only works if the grouping you wnat are tied
directly to field values (in my case, i need to support ranges, prefix
queries, and and boolean queries for each grouping item)

Along a similar line of thinking: if the fields you want to group by are
non-tokenized (ie: only one value per doc) then you can iterate of the set
bits from your orriginal search and looking the value for each matching
doc using the FieldCache. not sure if that would be more space/time
efficient then looping over the TermEnum/TermDocs ... i guess it depends
on the average size of your results, the average size of your index, and
the average number of terms in the fields you want to group by.



: Date: Mon, 30 Jan 2006 17:45:10 +0000 (GMT)
: From: mark harwood 
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: RE: grouping results by fields
:
: > A simple solution if you only have 20,000 docs is
: > just to iterate
: > through the hits and count them up against each
: > color etc,
:
: The one thing to avoid is reader.document() calls in
: such a tight loop. This is always a killer.
:
: The best way I've found is to create one bitset for
: all the matching docs then use TermEnum on the "group"
: field(s) to find all the docids - then check each
: docId against the "matches" bitset to accumulate
: scores for each unique "group" field value:
:
: TermEnum te = reader.terms(new
: Term(groupFieldName, ""));
: Term term = te.term();
: while (term!=null)
: {
: if (term.field().equals(groupFieldName))
: {
: TermDocs termDocs =
: reader.termDocs(term);
: GroupTotal groupTotal = null;
:
: boolean continueThisTerm = true;
: while ((continueThisTerm) &&
: (termDocs.next()))
: {
: int docID = termDocs.doc();
: if (queryMmatchedDocs.get(docId))
: {
: if (groupTotal == null)
: {
: //look up the group key
: and initialize
: String termText =
: term.text();
: Object key = termText;
: groupTotal = (GroupTotal)
: totals.get(key);
: if (groupTotal == null)
: {
: //no totals exist yet,
: create new one.
: groupTotal = new
: GroupTotal((key);
: totals.put(key,
: groupTotal);
: }
: }
:
: groupTotal.addQueryMatchDoc(docID);
: }
: }
: } else
: {
: break;
: }
: if(te.next())
: {
: term=te.term();
: }
: else
: {
: break;
: }
: }
:
: Cheers
: Mark
:
:
:
:
: ___________________________________________________________
: Win a BlackBerry device from O2 with Yahoo!. Enter now. 
http://www.yahoo.co.uk/blackberry
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: [EMAIL PROTECTED]
: For additional commands, e-mail: [EMAIL PROTECTED]
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

  


                
---------------------------------
Bring words and photos together (easily) with
 PhotoMail  - it's free and works with Yahoo! Mail.

RE: grouping results by fields

Reply via email to