On 29/11/13 11:18, Florent Georges wrote:
>    Hi,
>
>    I have a few millions entities in a collection, and would like to
> find duplicates in it.  There are several possible root element
> names (all documents do not have the same root element).
>
>    By duplicates, I mean any document where the value of the element
> /*/foobar is the same.  Foo is declared as a string in the schema.
> So what I am looking for, really, is the value of all the values of
> /*/foobar that appear in more than one document in that collection.
>
>    Because there are several millions documents in the collection,
> it is not possible to use the naive query that would look something
> like the following:
>
>      for $f in fn:collection('coll')/*/foobar
>      where fn:exists(fn:collection('coll')[/*/foobar eq $f][2])
>      return
>        <dup>{ $f }</dup>
>
>    There is a range element index configured for the element foobar.
>
>    Any idea how I can optimize the query (so it actually completes)?
>
>    Regards,

Use an element lexicon lookup:

cts:elememt-values(xs:QName("foobar"))[cts:frequency(.)>1]

This will give you all the values of the foobar element for which there 
are duplicates. Then you'll need to look up the documents for those values.

John

-- 
John Snelson, Lead Engineer                    http://twitter.com/jpcs
MarkLogic Corporation                         http://www.marklogic.com
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to