At this point, it's tough to be confident in estimating an upper bound on document size because we only have restricted samples of data to work from. However, the following are from the current sample I'm using:
sample size: 500 max doc size: ~1 M avg doc size: 208K Regarding updates, we can adjust the rate, but it will eventually depend on data currency requirements per client. Of course, it's average case that we're interested in when reasoning about how overall activity affects performance. I appreciate the feedback. I'm hearing that options (1) and (3) are both reasonable. The choice should be motivated by document sizes and update patterns. For now, I plan to go with (1), the simplest. As we get more data, I'll monitor performance and consider switching to option (3) as a backup strategy. Thanks, Karl On Fri, Dec 18, 2009 at 2:09 PM, Kelly Stirman <[email protected]> wrote: > It sounds like the only downside to approach 1 is the assumption that updates > will be slow. Generally speaking, MarkLogic processes updates very quickly, > on the order of 1 MB/sec/CPU. So, could you tell us more about how large > these documents might become, and the volume of updates to be processed per > day? > > If your infrastructure can accommodate the volume of updates, I think > approach 1 is the best option. > > Kelly > > Message: 2 > Date: Thu, 17 Dec 2009 18:39:35 -0600 > From: Karl Erisman <[email protected]> > Subject: [MarkLogic Dev General] Fragmentation planning > To: General Mark Logic Developer Discussion > <[email protected]> > Message-ID: > <[email protected]> > Content-Type: text/plain; charset=ISO-8859-1 > > My recent discovery that cts:and-query() does not span fragments (and > helpful input from list contributors) raises a new issue, one that we > all face: fragmentation. > > A recent thread ("Creating Collections") dealt with this issue and > contained some interesting ideas, especially many reasons to avoid > using fragments. However, there must be times when fragmentation is > necessary, but the thread seems to have ended without a clear > resolution on when and how to employ fragments, and when to avoid > them. > > Kelly Stirman cautions against the use of fragments: > > "...I think you'll find you have more options, the server is easy > to use, it will be more difficult to make a false step, and you'll have more > in > common with other developers if you don't use fragmentation and instead load > your nodes as individual documents. You may not have run into any limitations > thus far, but in my experience you will eventually." > > source: > ??? http://www.mail-archive.com/[email protected]/msg03478.html > > That's useful information (!), but loading new data in ML must not be > as simple as avoiding fragmentation. The decision involves many > factors, including these well-known ones: > (A) average document size > (B) optimizing query performance (in my case, I'm interested in a > related factor that I'd call "query strategy") > (C) optimizing update performance > > As promised, I'll describe my situation at a high level. Consider it > a mini case study. I can follow up with more detail if necessary. > > We store "patient charts" in documents, currently one document per > patient. Included in each chart is data that should be updated > frequently (e.g. demographic info, clinical visit history, and lab > results). See factor (C). Also, we need to search for documents by > specifying criteria from multiple sections. See factor (B). > > Currently, our fragmentation policy has each of the sections (e.g. > demographics, visit history, and lab results) as a fragment root. I > think this was predominantly motivated by factor (C) -- for example, > adding a lab result (a frequent event) would merely add a fragment. > However, I ran into trouble when using cts:and-query() to specify > search criteria from multiple fragments (cts:and-query doesn't span > fragments). > > Alternative solutions: > (1) If we simply stop using fragmentation, the queries work as > desired. But isn't that a bad idea since sections in the documents > will need frequent updating? > (2) If I change nothing about the fragmentation, I can run each > sub-query independently instead of using cts:and-query(), then take > the intersection (which may span fragments). But I'm reading that I > should try to avoid fragmentation, so... > (3) Another option would be to break up the document, creating > separate docs for each section, so if a document currently has (for > example) demographics and lab results, it would be split into two > documents. Directories could be used to group sets of documents by > section type (/demographics/10291004, /lab-results/10291004). The > cts: searches that specify demographic and lab criteria would be > performed separately, then recombined for a final result set as > described in (2). > > That seems like enough detail for now. Does (3) sound reasonable? > Any alternative suggestions? > > Thanks, > Karl > _______________________________________________ > General mailing list > [email protected] > http://xqzone.com/mailman/listinfo/general > _______________________________________________ General mailing list [email protected] http://xqzone.com/mailman/listinfo/general
