Excellent! Thank you both, John and Rob!
--
Florent Georges
http://fgeorges.org/
http://h2oconsulting.be/
Le Vendredi 29 novembre 2013 15h21, "Whitby, Rob" <[email protected]> a
écrit :
Hi Florent,
>
>
>cts:element-values can take a query:
>cts:element-values(xs:QName("foobar"), (), (),
>cts:collection-query("coll"))[cts:frequency(.)>1]
>
>
>Rob
>
>
>
>
>From: Florent Georges <[email protected]>
>Reply-To: Florent Georges <[email protected]>, MarkLogic Developer Discussion
><[email protected]>
>Date: Friday, 29 November 2013 15:04
>To: MarkLogic Developer Discussion <[email protected]>
>Subject: Re: [MarkLogic Dev General] Query needs to be optimized, does not
>complete
>
>
>
> Thank you, John! This works as is to look for duplicates across the entire
>database. But I need to restrict it to a specific collection (it is expected
>to have duplicates across the entire database, but not within that
>collection). So the query I came with is:
>
> fn:collection('coll')/*[
> cts:element-values(xs:QName('foobar'))[cts:frequency(.) > 1]]
>
> Unfortunately, it throws the error:
>
> XDMP-EXPNTREECACHEFULL: xdmp:eval([...]) -- Expanded tree cache
> full on host [...]
>
> Even if I call xdmp:expanded-tree-cache-clear(), it still throws the same
>error. Using count(cts:element-values(... is fine though.
>
> Any idea?
>
>--
>Florent Georges
>http://fgeorges.org/
>http://h2oconsulting.be/
>
>
>
>
>
>
>Le Vendredi 29 novembre 2013 11h25, John Snelson <[email protected]>
>a écrit :
>
>On 29/11/13 11:18, Florent Georges wrote:
>>> Hi,
>>>
>>> I have a few millions entities in a collection, and would like to
>>> find duplicates in it. There are several possible root element
>>> names (all documents do not have the same root element).
>>>
>>> By duplicates, I mean any document where the value of the element
>>> /*/foobar is the same. Foo is declared as a string in the schema.
>>> So what I am looking for, really, is the value of all the values of
>>> /*/foobar that appear in more than one document in that collection.
>>>
>>> Because there are several millions documents in the collection,
>>> it is not possible to use the naive query that would look something
>>> like the following:
>>>
>>> for $f in fn:collection('coll')/*/foobar
>>> where fn:exists(fn:collection('coll')[/*/foobar eq $f][2])
>>> return
>>> <dup>{ $f }</dup>
>>>
>>> There is a range element index configured for the element foobar.
>>>
>>> Any idea how I can optimize the query (so it actually completes)?
>>>
>>> Regards,
>>
>>Use an element lexicon lookup:
>>
>>cts:elememt-values(xs:QName("foobar"))[cts:frequency(.)>1]
>>
>>This will give you all the values of the foobar element for which there
>>are duplicates. Then you'll need to look up the documents for those values.
>>
>>John
>>
>>--
>>John Snelson, Lead Engineer http://twitter.com/jpcs
>>MarkLogic Corporation http://www.marklogic.com
>>
>>_______________________________________________
>>General mailing list
>>[email protected]
>>http://developer.marklogic.com/mailman/listinfo/general
>>
>>
>>
>
>_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general