Excellent!  Thank you both, John and Rob!

-- 

Florent Georges
http://fgeorges.org/
http://h2oconsulting.be/





Le Vendredi 29 novembre 2013 15h21, "Whitby, Rob" <[email protected]> a 
écrit :
 
Hi Florent, 
>
>
>cts:element-values can take a query:
>cts:element-values(xs:QName("foobar"), (), (), 
>cts:collection-query("coll"))[cts:frequency(.)>1]
>
>
>Rob
>
>
>
>
>From: Florent Georges <[email protected]>
>Reply-To: Florent Georges <[email protected]>, MarkLogic Developer Discussion 
><[email protected]>
>Date: Friday, 29 November 2013 15:04
>To: MarkLogic Developer Discussion <[email protected]>
>Subject: Re: [MarkLogic Dev General] Query needs to be optimized, does not 
>complete
>
>
>
>  Thank you, John!  This works as is to look for duplicates across the entire 
>database.  But I need to restrict it to a specific collection (it is expected 
>to have duplicates across the entire database, but not within that 
>collection).  So the query I came with is:
>
>    fn:collection('coll')/*[
>      cts:element-values(xs:QName('foobar'))[cts:frequency(.) > 1]]
>
>  Unfortunately, it throws the error:
>
>    XDMP-EXPNTREECACHEFULL: xdmp:eval([...]) -- Expanded tree cache
>    full on host [...]
>
>  Even if I call xdmp:expanded-tree-cache-clear(), it still throws the same 
>error.  Using count(cts:element-values(... is fine though.
>
>  Any idea?
>
>-- 
>Florent Georges
>http://fgeorges.org/
>http://h2oconsulting.be/
>
>
>
>
>
>
>Le Vendredi 29 novembre 2013 11h25, John Snelson <[email protected]> 
>a écrit :
>
>On 29/11/13 11:18, Florent Georges wrote:
>>>    Hi,
>>>
>>>    I have a few millions entities in a collection, and would like to
>>> find duplicates in it.  There are several possible root element
>>> names (all documents do not have the same root element).
>>>
>>>    By duplicates, I mean any document where the value of the element
>>> /*/foobar is the same.  Foo is declared as a string in the schema.
>>> So what I am looking for, really, is the value of all the values of
>>> /*/foobar that appear in more than one document in that collection.
>>>
>>>    Because there are several millions documents in the collection,
>>> it is not possible to use the naive query that would look something
>>> like the following:
>>>
>>>      for $f in fn:collection('coll')/*/foobar
>>>      where fn:exists(fn:collection('coll')[/*/foobar eq $f][2])
>>>      return
>>>        <dup>{ $f }</dup>
>>>
>>>    There is a range element index configured for the element foobar.
>>>
>>>    Any idea how I can optimize the query (so it actually completes)?
>>>
>>>    Regards,
>>
>>Use an element lexicon lookup:
>>
>>cts:elememt-values(xs:QName("foobar"))[cts:frequency(.)>1]
>>
>>This will give you all the values of the foobar element for which there 
>>are duplicates. Then you'll need to look up the documents for those values.
>>
>>John
>>
>>-- 
>>John Snelson, Lead Engineer                    http://twitter.com/jpcs
>>MarkLogic Corporation                        http://www.marklogic.com 
>>
>>_______________________________________________
>>General mailing list
>>[email protected]
>>http://developer.marklogic.com/mailman/listinfo/general
>>
>>
>>
>
>
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to