Re: [MarkLogic Dev General] Issue with Mark Logic Query (Michael Blakeley)

Michael Blakeley Sat, 03 Dec 2011 11:08:11 -0800

Right, big O notation doesn't tell the whole story for any system. In this case 
here's how I see it:


1. Search index lookup is O(log n) with database size, and is the same for 
count or estimate since both use the same query.
2. Returning matching fragment ids is O(n) with number of results, and is the 
same for count or estimate.
3. While the results of estimate will be unfiltered, count will filter. This 
requires retrieving fragments and testing their expanded trees, which will be 
O(n).

This is a bit of an oversimplification. In a cluster, for example, some of 
these steps are parallel across the queried forests and some are serialized in 
the eval host. But for most configurations and for the case where we want to 
know how many results matched, (3) will dominate the elapsed time when using 
count. When using estimate, (3) never happens.

-- Mike

On 3 Dec 2011, at 03:20 , Geert Josten wrote:

> I think there is more to it. Count forces actual data to be retrieved from
> the database nodes, while xdmp:estimate uses memory-based indexes. So it
> can save a lot of latency as well..
> 
> Kind regards,
> Geert
> 
> -----Oorspronkelijk bericht-----
> Van: [email protected]
> [mailto:[email protected]] Namens Paul M
> Verzonden: vrijdag 2 december 2011 17:28
> Aan: [email protected]
> Onderwerp: Re: [MarkLogic Dev General] Issue with Mark Logic Query
> (Michael Blakeley)
> 
> So if count is O(n), xdmp:estimate is a log n or some such ? Just curious.
> 
> 
> 
> ----- Original Message -----
> From: "[email protected]"
> <[email protected]>
> To: [email protected]
> Cc:
> Sent: Thursday, December 1, 2011 3:00 PM
> Subject: General Digest, Vol 90, Issue 3
> 
> Send General mailing list submissions to
>     [email protected]
> 
> To subscribe or unsubscribe via the World Wide Web, visit
>     http://developer.marklogic.com/mailman/listinfo/general
> or, via email, send a message with subject or body 'help' to
>     [email protected]
> 
> You can reach the person managing the list at
>     [email protected]
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of General digest..."
> 
> 
> Today's Topics:
> 
>    1. Re: Issue with Mark Logic Query (Michael Blakeley)
>    2. large (?) number of range indexes (Mike Sokolov)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Thu, 1 Dec 2011 09:05:50 -0800
> From: Michael Blakeley <[email protected]>
> Subject: Re: [MarkLogic Dev General] Issue with Mark Logic Query
> To: General MarkLogic Developer Discussion
>     <[email protected]>
> Cc: [email protected]
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset=us-ascii
> 
> To query the value of an element, use an element-value-query term like
> this:
> 
>   cts:element-value-query(xs:QName('meta:DateLoaded'), '2011*')
> 
> But since that uses a wildcard glob, it won't resolve from indexes unless
> you also have appropriate wildcards enabled. If you have an element range
> index on meta:DateLoaded with type=date, it would probably be better to
> specify a range instead of a wildcard:
> 
>   cts:element-range-query(xs:QName('meta:DateLoaded'), '>=',
> xs:date('2011-01-01')),
>   cts:element-range-query(xs:QName('meta:DateLoaded'), '<',
> xs:date('2012-01-01'))
> 
> Finally, it may be faster to evaluate the entire cts:query using
> xdmp:estimate(cts:search($query)) rather than count(cts:uris($query)).
> Using count() will be O(n) with the number of results. Note that both
> count and estimate support an optional limit argument, which might be
> useful for your '1 to 1000000' limit.
> 
> -- Mike
> 
> On 1 Dec 2011, at 01:46 , amit gope wrote:
> 
>> Hi All,
>> 
>> I have a database where the element range index is on the element date,
> and now i am executing a query where i have used element value query on
> one of the elements, but the results fetched are not adhering to the
> query, please suggest the changes that i need to make.
>> 
>> let $uri :=(cts:uris('', ('document','limit=1000000'),
>>              (cts:and-query((cts:directory-query('/content/',
> 'infinity'),
>>          cts:element-query((xs:QName('meta:DateLoaded')),'2011*'),
>>          cts:element-query((xs:QName('meta:PubName')),'Springer'),
>>              cts:element-query(xs:QName('Affiliation'), cts:and-query((),
> ())),
>> 
> cts:element-query(xs:QName('meta:Institution'),cts:and-query((),())),
>>          cts:not-query(cts:element-query(xs:QName("meta:GeoOrg"),
> cts:and-query((), ())))
>>                ), ())), (), ()))[1 to 1000000]
>> return (count($uri),$uri)
>> 
>> 
>> In the above query it is fetching me uri's of those articles where the
> meta dateloaded is 2010. Please suggest
>> 
>> --
>> Regards
>> Amit
>> 
>> 
>> _______________________________________________
>> General mailing list
>> [email protected]
>> http://developer.marklogic.com/mailman/listinfo/general
> 
> 
> 
> ------------------------------
> 
> Message: 2
> Date: Thu, 01 Dec 2011 14:23:07 -0500
> From: Mike Sokolov <[email protected]>
> Subject: [MarkLogic Dev General] large (?) number of range indexes
> To: [email protected]
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
> 
> I've found that cts:element-values() is *much* faster when you don't use
> a query to filter.  For example,
> 
> cts:element-values (xs:QName("foo"), "a")
> 
> is 25x faster than
> 
> cts:element-values (xs:QName("foo"), "a",
> cts:element-value-query(xs:QName("bar"), "baz"))
> 
> when every document indexed by foo in fact has bar=baz, ie when the
> query is essentially a no-op.
> 
> Consequently, we're taking what used to be a bunch of large range
> indexes and breaking them up into a lot of smaller range indexes, each
> of which we can query independently (faster).
> 
> What I'm wondering is if anybody would care to speculate on whether
> having a large number of small(er) indexes will pose some other
> performance problem.  Presumably at least some of the keys will be
> shared across these indexes, but the values (the fragment/document
> references) should not, so overall storage should be only slightly larger?
> 
> --
> Michael Sokolov
> Engineering Director
> www.ifactory.com
> 
> 
> 
> ------------------------------
> 
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
> 
> 
> End of General Digest, Vol 90, Issue 3
> **************************************
> 
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
> 

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Issue with Mark Logic Query (Michael Blakeley)

Reply via email to