Re: [MarkLogic Dev General] collection() vs. xdmp:directory() vs. element range index performance

Evan Lenz Tue, 17 Jan 2012 14:32:02 -0800

You can do faceting with collections too (option #1). Each collection URI
would have the name and value embedded in it, e.g.: pub/web, pub/print,
etc.


Then you can use a collection constraint. Assuming "pub/" as the URI
prefix, you'd pass this in the options node for search:search():

<constraint name="pub">
  <collection prefix="pub/">
    </constraint>

The Search API then uses cts:uri-match("pub/*") under the covers for fast
retrieval of the facet values.

This is how we do faceted search on the Developer Community website, as
described here: 
http://developer.marklogic.com/blog/collection-constraints-are-cool



Evan Lenz
Software Developer, Community
MarkLogic Corporation
http://developer.marklogic.com



On 1/16/12 9:03 AM, "Michael Blakeley" <[email protected]> wrote:

>If you think you might want to introduce a search facet on
>publication-id, then I would go with approach (3). Otherwise, (1) and (2)
>are cheaper and pretty much equivalent in performance.
>
>There is also option (4): use an element-value query, without a range
>index, and rely on the automatic element-value indexing. For your primary
>use case, the performance of (4) won't be significantly different from
>the other options, and it will use the least disk space and memory.
>
>Approach (4) only becomes unsuitable if you need to check thousands or
>millions of publication-id values in a single query. At that point each
>list-cache miss can drive an I/O read, which gets expensive for thousands
>or millions. With only five values, though, your list-cache misses on
>publication-id should be few and cheap.
>
>-- Mike
>
>On 16 Jan 2012, at 07:57 , David Sewell wrote:
>
>> We're developing a MarkLogic-based project where the data consists of
>>around 
>> 100K XML documents. Each document belongs to one of 5 different
>>publications, 
>> which need to be differentiated for certain searches. I'm aware of at
>>least 
>> three methods of handling this differentiation:
>> 
>> 1) assign each document to a collection and use cts:collection-query()
>>or 
>> equivalent;
>> 
>> 2) load documents into subdirectories, one to each publication, and use
>> cts:directory-query() or equivalent;
>> 
>> 3) store publication identifier in the XML data as an element, then
>>create an 
>> element range index to enable searches on it.
>> 
>> Is there any way to guesstimate which of these approaches will have the
>>best 
>> performance when combined with various word and element queries, or
>>will it 
>> require empirical testing?
>> 
>> David
>> 
>> -- 
>> David Sewell, Editorial and Technical Manager
>> ROTUNDA, The University of Virginia Press
>> PO Box 400314, Charlottesville, VA 22904-4314 USA
>> Email: [email protected]   Tel: +1 434 924 9973
>> Web: http://rotunda.upress.virginia.edu/
>> _______________________________________________
>> General mailing list
>> [email protected]
>> http://developer.marklogic.com/mailman/listinfo/general
>> 
>
>_______________________________________________
>General mailing list
>[email protected]
>http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] collection() vs. xdmp:directory() vs. element range index performance

Reply via email to