Hi Terry, Why not have another index in which a document has one field for the parent and another field containing all of its children. An OR query over the "children" field would return you exactly what you want - one document for each distinct parent.
Steve dontspamterry wrote: > Hi all, > > I know this whole distinct query has been discussed a bunch of times for > various scenarios because I've been scouring the forums trying to find a > clue as to how I could solve my problem. I'm indexing a large set of > parent-child term relations (~1 million). The number of unique terms is > about ~570,000. Each relation is a document. Each term in a relation > contains all of the term's attributes. Effectively, a term's attributes will > be duplicated "x" number of times for the "x" number of relations it > participates in. For example, say I have the following term tree: > > A > |--B > |--E > |--H > |--F > |--C > |--G > |--D > > I would then have documents for: > A->B, A->C, A->D, B->E, (and so forth...) > > For all relations involving A, A's attributes will be duplicated in 3 > separate documents. > For all relations involving B, B's attributes will be duplicated in 3 > separate documents. > (you get the picture...) > > This index structure works great for queries which traverse up and down the > tree. However, I have a requirement where I would also like to do a distinct > query which returns the data for each unique term satisfying the query. For > example, say I have a query which returns all relations where A or B is the > parent (that would be 5 documents in total), > but do a distinct on the parent such that I get 2 documents back, one for A > as the parent (any 1 of the 3 matching docs) and the other where B is the > parent (any 1 of the 2 matching docs). For this query, I don't care about > the child information since I'm only interested in retrieving the distinct > parent terms. This query is analogous to a 'select distinct <set of parent > term attributes>' . I played around with caching BitSets for the fields > which I'd like to do a distinct on, but given the amount of data, I run out > of memory. I also took the approach where I retrieve the bitset using a > queryfilter and then process each document id, hashing the field values on > which I'm doing a distinct to construct my distinct set. Problem with this > is that I have tree structures where a parent has over 100K children. > Retrieving each doc for this size is too time- and memory- consuming. Since > I don't really want to return that much data, I thought that I could use > paging. The problem I faced is that I do not know if a distinct value in the > current query was actually returned in some previous query for a previous > page. > > Sorry for the long description, but wanted to make sure I explained it as > clearly as I could. > > -Terry --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]