Re: Multi-field distinct query

Grant Ingersoll Tue, 15 May 2007 18:08:04 -0700

I suggest you ask on the user mailing list (java-[EMAIL PROTECTED]) as you are likely to get a lot more interestfrom others. java-dev is for discussing the internals of how Luceneworks.



Thanks,
Grant

On May 15, 2007, at 7:13 PM, dontspamterry wrote:

Hi all,
I know this whole distinct query has been discussed a bunch oftimes forvarious scenarios because I've been scouring the forums trying tofind a
clue as to how I could solve my problem. I'm indexing a large set of
parent-child term relations (~1 million). The number of uniqueterms is
about ~570,000. Each relation is a document. Each term in a relation
contains all of the term's attributes. Effectively, a term'sattributes will
be duplicated "x" number of times for the "x" number of relations it
participates in. For example, say I have the following term tree:

A
|--B
    |--E
        |--H
    |--F
|--C
    |--G
|--D

I would then have documents for:
A->B, A->C, A->D, B->E, (and so forth...)

For all relations involving A, A's attributes will be duplicated in 3
separate documents.
For all relations involving B, B's attributes will be duplicated in 3
separate documents.
(you get the picture...)
This index structure works great for queries which traverse up anddown thetree. However, I have a requirement where I would also like to do adistinctquery which returns the data for each unique term satisfying thequery. Forexample, say I have a query which returns all relations where A orB is the
parent (that would be 5 documents in total),
but do a distinct on the parent such that I get 2 documents back,one for Aas the parent (any 1 of the 3 matching docs) and the other where Bis theparent (any 1 of the 2 matching docs). For this query, I don't careaboutthe child information since I'm only interested in retrieving thedistinctparent terms. This query is analogous to a 'select distinct <set ofparentterm attributes>' . I played around with caching BitSets for thefieldswhich I'd like to do a distinct on, but given the amount of data, Irun outof memory. I also took the approach where I retrieve the bitsetusing aqueryfilter and then process each document id, hashing the fieldvalues onwhich I'm doing a distinct to construct my distinct set. Problemwith this
is that I have tree structures where a parent has over 100K children.
Retrieving each doc for this size is too time- and memory-consuming. SinceI don't really want to return that much data, I thought that Icould usepaging. The problem I faced is that I do not know if a distinctvalue in thecurrent query was actually returned in some previous query for aprevious
page.
Sorry for the long description, but wanted to make sure I explainedit as
clearly as I could.

-Terry
--
View this message in context: http://www.nabble.com/Multi-field-distinct-query-tf3761682.html#a10633050Sent from the Lucene - Java Developer mailing list archive atNabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/LuceneFAQ




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Multi-field distinct query

Reply via email to