Thanks, Michael and John, for the thorough and useful responses,

David S.

On Mon, 16 Jan 2012, Michael Blakeley wrote:

> If you think you might want to introduce a search facet on publication-id, 
> then I would go with approach (3). Otherwise, (1) and (2) are cheaper and 
> pretty much equivalent in performance.
>
> There is also option (4): use an element-value query, without a range index, 
> and rely on the automatic element-value indexing. For your primary use case, 
> the performance of (4) won't be significantly different from the other 
> options, and it will use the least disk space and memory.
>
> Approach (4) only becomes unsuitable if you need to check thousands or 
> millions of publication-id values in a single query. At that point each 
> list-cache miss can drive an I/O read, which gets expensive for thousands or 
> millions. With only five values, though, your list-cache misses on 
> publication-id should be few and cheap.
>
> -- Mike
>
> On 16 Jan 2012, at 07:57 , David Sewell wrote:
>
>> We're developing a MarkLogic-based project where the data consists of around
>> 100K XML documents. Each document belongs to one of 5 different publications,
>> which need to be differentiated for certain searches. I'm aware of at least
>> three methods of handling this differentiation:
>>
>> 1) assign each document to a collection and use cts:collection-query() or
>> equivalent;
>>
>> 2) load documents into subdirectories, one to each publication, and use
>> cts:directory-query() or equivalent;
>>
>> 3) store publication identifier in the XML data as an element, then create an
>> element range index to enable searches on it.
>>
>> Is there any way to guesstimate which of these approaches will have the best
>> performance when combined with various word and element queries, or will it
>> require empirical testing?
>>
>> David
>>
>> --
>> David Sewell, Editorial and Technical Manager
>> ROTUNDA, The University of Virginia Press
>> PO Box 400314, Charlottesville, VA 22904-4314 USA
>> Email: [email protected]   Tel: +1 434 924 9973
>> Web: http://rotunda.upress.virginia.edu/
>> _______________________________________________
>> General mailing list
>> [email protected]
>> http://developer.marklogic.com/mailman/listinfo/general
>>
>
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
>

-- 
David Sewell, Editorial and Technical Manager
ROTUNDA, The University of Virginia Press
PO Box 400314, Charlottesville, VA 22904-4314 USA
Email: [email protected]   Tel: +1 434 924 9973
Web: http://rotunda.upress.virginia.edu/
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to