Hi Florent,
cts:element-values can take a query:
cts:element-values(xs:QName("foobar"), (), (),
cts:collection-query("coll"))[cts:frequency(.)>1]
Rob
From: Florent Georges <[email protected]<mailto:[email protected]>>
Reply-To: Florent Georges <[email protected]<mailto:[email protected]>>,
MarkLogic Developer Discussion
<[email protected]<mailto:[email protected]>>
Date: Friday, 29 November 2013 15:04
To: MarkLogic Developer Discussion
<[email protected]<mailto:[email protected]>>
Subject: Re: [MarkLogic Dev General] Query needs to be optimized, does not
complete
Thank you, John! This works as is to look for duplicates across the entire
database. But I need to restrict it to a specific collection (it is expected
to have duplicates across the entire database, but not within that collection).
So the query I came with is:
fn:collection('coll')/*[
cts:element-values(xs:QName('foobar'))[cts:frequency(.) > 1]]
Unfortunately, it throws the error:
XDMP-EXPNTREECACHEFULL: xdmp:eval([...]) -- Expanded tree cache
full on host [...]
Even if I call xdmp:expanded-tree-cache-clear(), it still throws the same
error. Using count(cts:element-values(... is fine though.
Any idea?
--
Florent Georges
http://fgeorges.org/
http://h2oconsulting.be/
Le Vendredi 29 novembre 2013 11h25, John Snelson
<[email protected]<mailto:[email protected]>> a écrit :
On 29/11/13 11:18, Florent Georges wrote:
> Hi,
>
> I have a few millions entities in a collection, and would like to
> find duplicates in it. There are several possible root element
> names (all documents do not have the same root element).
>
> By duplicates, I mean any document where the value of the element
> /*/foobar is the same. Foo is declared as a string in the schema.
> So what I am looking for, really, is the value of all the values of
> /*/foobar that appear in more than one document in that collection.
>
> Because there are several millions documents in the collection,
> it is not possible to use the naive query that would look something
> like the following:
>
> for $f in fn:collection('coll')/*/foobar
> where fn:exists(fn:collection('coll')[/*/foobar eq $f][2])
> return
> <dup>{ $f }</dup>
>
> There is a range element index configured for the element foobar.
>
> Any idea how I can optimize the query (so it actually completes)?
>
> Regards,
Use an element lexicon lookup:
cts:elememt-values(xs:QName("foobar"))[cts:frequency(.)>1]
This will give you all the values of the foobar element for which there
are duplicates. Then you'll need to look up the documents for those values.
John
--
John Snelson, Lead Engineer http://twitter.com/jpcs
MarkLogic Corporation
http://www.marklogic.com<http://www.marklogic.com/>
_______________________________________________
General mailing list
[email protected]<mailto:[email protected]>
http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general