Thank you, John!  This works as is to look for duplicates across the entire 
database.  But I need to restrict it to a specific collection (it is expected 
to have duplicates across the entire database, but not within that collection). 
 So the query I came with is:

    fn:collection('coll')/*[
      cts:element-values(xs:QName('foobar'))[cts:frequency(.) > 1]]

  Unfortunately, it throws the error:

    XDMP-EXPNTREECACHEFULL: xdmp:eval([...]) -- Expanded tree cache
    full on host [...]

  Even if I call xdmp:expanded-tree-cache-clear(), it still throws the same 
error.  Using count(cts:element-values(... is fine though.

  Any idea?

-- 
Florent Georges
http://fgeorges.org/
http://h2oconsulting.be/






Le Vendredi 29 novembre 2013 11h25, John Snelson <[email protected]> a 
écrit :
 
On 29/11/13 11:18, Florent Georges wrote:
>>    Hi,
>>
>>    I have a few millions entities in a collection, and would like to
>> find duplicates in it.  There are several possible root element
>> names (all documents do not have the same root element).
>>
>>    By duplicates, I mean any document where the value of the element
>> /*/foobar is the same.  Foo is declared as a string in the schema.
>> So what I am looking for, really, is the value of all the values of
>> /*/foobar that appear in more than one document in that collection.
>>
>>    Because there are several millions documents in the collection,
>> it is not possible to use the naive query that would look something
>> like the following:
>>
>>      for $f in fn:collection('coll')/*/foobar
>>      where fn:exists(fn:collection('coll')[/*/foobar eq $f][2])
>>      return
>>        <dup>{ $f }</dup>
>>
>>    There is a range element index configured for the element foobar.
>>
>>    Any idea how I can optimize the query (so it actually completes)?
>>
>>    Regards,
>
>Use an element lexicon lookup:
>
>cts:elememt-values(xs:QName("foobar"))[cts:frequency(.)>1]
>
>This will give you all the values of the foobar element for which there 
>are duplicates. Then you'll need to look up the documents for those values.
>
>John
>
>-- 
>John Snelson, Lead Engineer                    http://twitter.com/jpcs
>MarkLogic Corporation                        http://www.marklogic.com
>
>_______________________________________________
>General mailing list
>[email protected]
>http://developer.marklogic.com/mailman/listinfo/general
>
>
>
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to