The important thing to realize when dealing with marklogic (or I suspect ANY 
database) ... now that your dealing with potentially unbounded list of 
documents and values instead of one document,
its critical to do everything possible to bring  into the XQuery context as 
little data as possible.
This takes a considerable world view change when you first do this (atleast it 
did for me and everyone I've talked to) ...
Techniques that worked in a standalone xquery processor on one document often 
do not translate well when dealing with a  database of a million, a billion or 
a trillion documents.

Your query selects all documents from the entire collection into memory then 
filters it.   Thats a lot of in-memory data required.

Rob's solution moves that filtering down to the database and index level where 
only the minimum amount of data is brought into the xquery context.
In a cluster this would be further optimized by being done in parallel in a 
true Map/Reduce style (automagically for you).

This is the life and death of optimizing queries.  Bring as little as you can 
into the xquery context by making use of indexing, search, and filtering 
functions then from there
your dealing with a hopefully small set of data in-memory where you can have 
at-it xquery-style.



-----------------------------------------------------------------------------
David Le
Lead Engineer
MarkLogic Corporation
[email protected]
Phone: +1 812-482-5224
Cell:  +1 812-630-7622
www.marklogic.com<http://www.marklogic.com/>


From: [email protected] 
[mailto:[email protected]] On Behalf Of Whitby, Rob
Sent: Friday, November 29, 2013 10:21 AM
To: Florent Georges; MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Query needs to be optimized, does not 
complete

Hi Florent,

cts:element-values can take a query:
cts:element-values(xs:QName("foobar"), (), (), 
cts:collection-query("coll"))[cts:frequency(.)>1]

Rob


From: Florent Georges <[email protected]<mailto:[email protected]>>
Reply-To: Florent Georges <[email protected]<mailto:[email protected]>>, 
MarkLogic Developer Discussion 
<[email protected]<mailto:[email protected]>>
Date: Friday, 29 November 2013 15:04
To: MarkLogic Developer Discussion 
<[email protected]<mailto:[email protected]>>
Subject: Re: [MarkLogic Dev General] Query needs to be optimized, does not 
complete

  Thank you, John!  This works as is to look for duplicates across the entire 
database.  But I need to restrict it to a specific collection (it is expected 
to have duplicates across the entire database, but not within that collection). 
 So the query I came with is:

    fn:collection('coll')/*[
      cts:element-values(xs:QName('foobar'))[cts:frequency(.) > 1]]

  Unfortunately, it throws the error:

    XDMP-EXPNTREECACHEFULL: xdmp:eval([...]) -- Expanded tree cache
    full on host [...]

  Even if I call xdmp:expanded-tree-cache-clear(), it still throws the same 
error.  Using count(cts:element-values(... is fine though.

  Any idea?

--
Florent Georges
http://fgeorges.org/
http://h2oconsulting.be/


Le Vendredi 29 novembre 2013 11h25, John Snelson 
<[email protected]<mailto:[email protected]>> a écrit :
On 29/11/13 11:18, Florent Georges wrote:
>    Hi,
>
>    I have a few millions entities in a collection, and would like to
> find duplicates in it.  There are several possible root element
> names (all documents do not have the same root element).
>
>    By duplicates, I mean any document where the value of the element
> /*/foobar is the same.  Foo is declared as a string in the schema.
> So what I am looking for, really, is the value of all the values of
> /*/foobar that appear in more than one document in that collection.
>
>    Because there are several millions documents in the collection,
> it is not possible to use the naive query that would look something
> like the following:
>
>      for $f in fn:collection('coll')/*/foobar
>      where fn:exists(fn:collection('coll')[/*/foobar eq $f][2])
>      return
>        <dup>{ $f }</dup>
>
>    There is a range element index configured for the element foobar.
>
>    Any idea how I can optimize the query (so it actually completes)?
>
>    Regards,

Use an element lexicon lookup:

cts:elememt-values(xs:QName("foobar"))[cts:frequency(.)>1]

This will give you all the values of the foobar element for which there
are duplicates. Then you'll need to look up the documents for those values.

John

--
John Snelson, Lead Engineer                    http://twitter.com/jpcs
MarkLogic Corporation                        
http://www.marklogic.com<http://www.marklogic.com/>

_______________________________________________
General mailing list
[email protected]<mailto:[email protected]>
http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to