Re: [MarkLogic Dev General] data size of collection

Michael Blakeley Thu, 28 Mar 2013 09:38:22 -0700

The information you're looking for isn't part of the on-disk structures, so 
decoding the fragment is the only built-in way to measure it. Also watch out 
for anomalies between on-disk and string-length. On-disk size will include 
index size and deleted fragments, plus the compressed tree storage will be 
smaller than the string-length.

But if string-length is accurate enough, then probably sampling is also 
accurate enough. So I would just sample N documents in the collection, average, 
then multiply by xdmp:estimate for the collection. That ought to give you a 
pretty good idea with reasonable speed. If using the default document order 
leads to problems, this might be a good use for the 'random' option for 
cts:search.

If sampling doesn't work out, you could add an element or a property that 
tracks string-length. That could happen as part of existing ingestion 
processing, or using CPF. However you generate the data, create an appropriate 
range-index and then use it with cts:sum or cts:sum-aggregate.

-- Mike

On 28 Mar 2013, at 08:50 , "Gary Larsen" <[email protected]> wrote:

> Hi,
>  
> Is there a quick method to estimate the size in bytes of all documents in a 
> collection?   Would like to determine where the size of the database is 
> increasing.  I realize that document size may not mirror the database size 
> but good enough for what I need.
>  
> Reading each document’s string-length is painfully slow on large collections. 
>  Thanks
>  
> Gary
>  
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] data size of collection

Reply via email to