The important thing to realize when dealing with marklogic (or I suspect ANY database) ... now that your dealing with potentially unbounded list of documents and values instead of one document, its critical to do everything possible to bring into the XQuery context as little data as possible. This takes a considerable world view change when you first do this (atleast it did for me and everyone I've talked to) ... Techniques that worked in a standalone xquery processor on one document often do not translate well when dealing with a database of a million, a billion or a trillion documents.
Your query selects all documents from the entire collection into memory then filters it. Thats a lot of in-memory data required. Rob's solution moves that filtering down to the database and index level where only the minimum amount of data is brought into the xquery context. In a cluster this would be further optimized by being done in parallel in a true Map/Reduce style (automagically for you). This is the life and death of optimizing queries. Bring as little as you can into the xquery context by making use of indexing, search, and filtering functions then from there your dealing with a hopefully small set of data in-memory where you can have at-it xquery-style. ----------------------------------------------------------------------------- David Le Lead Engineer MarkLogic Corporation [email protected] Phone: +1 812-482-5224 Cell: +1 812-630-7622 www.marklogic.com<http://www.marklogic.com/> From: [email protected] [mailto:[email protected]] On Behalf Of Whitby, Rob Sent: Friday, November 29, 2013 10:21 AM To: Florent Georges; MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] Query needs to be optimized, does not complete Hi Florent, cts:element-values can take a query: cts:element-values(xs:QName("foobar"), (), (), cts:collection-query("coll"))[cts:frequency(.)>1] Rob From: Florent Georges <[email protected]<mailto:[email protected]>> Reply-To: Florent Georges <[email protected]<mailto:[email protected]>>, MarkLogic Developer Discussion <[email protected]<mailto:[email protected]>> Date: Friday, 29 November 2013 15:04 To: MarkLogic Developer Discussion <[email protected]<mailto:[email protected]>> Subject: Re: [MarkLogic Dev General] Query needs to be optimized, does not complete Thank you, John! This works as is to look for duplicates across the entire database. But I need to restrict it to a specific collection (it is expected to have duplicates across the entire database, but not within that collection). So the query I came with is: fn:collection('coll')/*[ cts:element-values(xs:QName('foobar'))[cts:frequency(.) > 1]] Unfortunately, it throws the error: XDMP-EXPNTREECACHEFULL: xdmp:eval([...]) -- Expanded tree cache full on host [...] Even if I call xdmp:expanded-tree-cache-clear(), it still throws the same error. Using count(cts:element-values(... is fine though. Any idea? -- Florent Georges http://fgeorges.org/ http://h2oconsulting.be/ Le Vendredi 29 novembre 2013 11h25, John Snelson <[email protected]<mailto:[email protected]>> a écrit : On 29/11/13 11:18, Florent Georges wrote: > Hi, > > I have a few millions entities in a collection, and would like to > find duplicates in it. There are several possible root element > names (all documents do not have the same root element). > > By duplicates, I mean any document where the value of the element > /*/foobar is the same. Foo is declared as a string in the schema. > So what I am looking for, really, is the value of all the values of > /*/foobar that appear in more than one document in that collection. > > Because there are several millions documents in the collection, > it is not possible to use the naive query that would look something > like the following: > > for $f in fn:collection('coll')/*/foobar > where fn:exists(fn:collection('coll')[/*/foobar eq $f][2]) > return > <dup>{ $f }</dup> > > There is a range element index configured for the element foobar. > > Any idea how I can optimize the query (so it actually completes)? > > Regards, Use an element lexicon lookup: cts:elememt-values(xs:QName("foobar"))[cts:frequency(.)>1] This will give you all the values of the foobar element for which there are duplicates. Then you'll need to look up the documents for those values. John -- John Snelson, Lead Engineer http://twitter.com/jpcs MarkLogic Corporation http://www.marklogic.com<http://www.marklogic.com/> _______________________________________________ General mailing list [email protected]<mailto:[email protected]> http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
