In general, the goal is to organize your documents so that they make sense as standalone units for queries and updates. Based on those statistics, it looks to me like you're in a pretty good place for that, already.
If your documents were 20x larger, and you had a strict separation between 'doc' and 'cite' content, then fragmentation might make sense. From a strict data throughput point of view, the ideal would be if every fragment were exactly the size of a disk I/O (which might be 32-256 kB). You are in that ballpark: your average document is smaller than an I/O, and your largest documents aren't all that large. So I would not fragment those documents. For the facets, I would stick with the multiple range indexes strategy. -- Mike On 18 Oct 2011, at 18:47 , Will Thompson wrote: > Thanks Mike, this is very helpful. > > We store our XML per chapter, but fragment on /doc (typically 1st level > headings under chapter) in ML. I ran some numbers -- These are the values for > the number of characters in a /doc: > > <stats xmlns:xml="http://www.w3.org/XML/1998/namespace"> > <median>1720</median> > <min>91</min> > <max>158093</max> > <avg>4208.54800490263</avg> > </stats> > > And these are the values for number of cites per /doc: > > <stats xmlns:xml="http://www.w3.org/XML/1998/namespace"> > <median>4</median> > <min>0</min> > <max>531</max> > <avg>12.9925098733488</avg> > </stats> > > A typical citation is about 150 characters of data or less; 250 if you > include metadata. Since we have a relatively small database, I assumed it > would be most performant to include a fragment root for /cite, even if it > bloats up our index. > > -Will > > > -----Original Message----- > From: [email protected] > [mailto:[email protected]] On Behalf Of Michael Blakeley > Sent: Tuesday, October 18, 2011 8:06 PM > To: General MarkLogic Developer Discussion > Subject: Re: [MarkLogic Dev General] How to get different facet counts for > different searchable-expression in Search API > > Yes, the whole purpose of fragment rules is to create more fragments. It > isn't necessarily a win for accurate estimate: accuracy could also get worse, > depending on your use cases. > > But if the fragment counts didn't change, then the fragment root probably had > an error in its specification. Wrong namespace, maybe? I wouldn't worry about > it: from what I'm reading in this thread, I don't think you'd benefit from > fragmenting these documents. If you do want to review it more thoroughly, > though, I'd start by asking for content statistics. What are the min, max, > and median sizes of your documents, and how much of that size is in cite > elements? What are the min, max, and median counts of cite elements per > document? > > -- Mike > > On 18 Oct 2011, at 17:53 , Will Thompson wrote: > >> Ah, that makes sense. I am already dynamically building the options node >> based on some querystring params, so it should be pretty straightforward to >> build this in based on the results of a string search. I doubt it would >> generate false positives. >> >> Also, when I was testing, I did create a fragment root on 'cite', but the >> fragment counts are the same regardless. Should they be different >> with/without? I thought having the fragment root would just improve the >> accuracy of xdmp:estimate(). >> >> -W >> >> >> -----Original Message----- >> From: [email protected] >> [mailto:[email protected]] On Behalf Of Michael >> Blakeley >> Sent: Tuesday, October 18, 2011 7:31 PM >> To: General MarkLogic Developer Discussion >> Subject: Re: [MarkLogic Dev General] How to get different facet counts for >> different searchable-expression in Search API >> >> Will, if I can jump in.... I think your idea of using different QNames is >> the right way to look at it. >> >> Facets are built from range indexes, and range indexes contain lists of >> values and fragment ids for a given QName. So if the query matches the >> fragment, the facet will show all the values in that fragment. In your case >> the fragment is the entire document, so you will see all the values in the >> matching documents, whether they occur under /doc or under /doc//cite. Now, >> you *could* create a fragment root on 'cite', but I think that would be >> counter-productive. It's better to use different QNames and have different >> range indexes. >> >> So I think what you'd want to do is simply arrange for a different set of >> search options for doc vs cite, including both searchable expression and >> constraints. Testing for that could be as simple as a call to >> cts:contains($user-search, 'select:cite') before you call search:search(). >> Or if that might generate false positives, you could search:parse the user >> query and then look at the cts:query XML to see whether or not the parser >> found a select:cite term. If it did, then you can switch to the correct >> options before calling search:resolve. >> >> -- Mike >> >> On 18 Oct 2011, at 17:14 , Will Thompson wrote: >> >>> Micah, >>> >>> I think I may have explained poorly. This is essentially what I'm doing -- >>> Docs are, generally, like this: >>> >>> <doc> >>> <search-meta/> >>> <p>...<cite><search-meta/></cite>...</p> >>> <section> >>> <p>...<cite><search-meta/></cite>...</p> >>> ... >>> </section> >>> </doc> >>> >>> Searches operate over //doc by default, but if you add the operator/state >>> "select:cite" it changes the searchable expression to //cite. The results >>> are correct, but the problem is that the facet counts appear to be for >>> *both* doc and cite metadata, and thus do not change when toggling >>> searchable-expressions via operator/state. >>> >>> This won't make any sense to our users, who will expect the facet counts to >>> match what they think they're searching for. >>> >>> -W >>> >>> >>> -----Original Message----- >>> From: [email protected] >>> [mailto:[email protected]] On Behalf Of Micah Dubinko >>> Sent: Tuesday, October 18, 2011 6:56 PM >>> To: General MarkLogic Developer Discussion >>> Subject: Re: [MarkLogic Dev General] How to get different facet counts for >>> different searchable-expression in Search API >>> >>> Hi Will, >>> >>> Everything you want to search exists in document fragments (not properties) >>> right? >>> >>> What would happen if you switched in a different searchable-expression via >>> operator and state? The combined query is taken into account by faceting, >>> but the searchable-expression is not. >>> >>> -m >>> >>> >>> On Oct 18, 2011, at 4:42 PM, Will Thompson wrote: >>> >>>> Our app has typically searched only document-type elements, but I recently >>>> added metadata to citation elements (contained within and scattered about >>>> document elements) so that they can be optionally searched using an >>>> operator. i.e.: "term1 term2 select:citations" The operator changes the >>>> searchable-expression and transform-results to search only citation >>>> elements and return citation-specific snippets. >>>> >>>> However, I need the facet counts to reflect the search being performed - >>>> i.e.: only show estimates for document element direct-child metadata >>>> during normal search, and only for citations when that is toggled using >>>> the operator. >>>> >>>> My first thought was to use different names or namespace for the citation >>>> metadata and have the operator toggle a separate set of constraints >>>> associated with those names. But constraints are not supported children of >>>> search:state under search:operator. >> >>>> >>>> Any ideas on how to accomplish this with Search API? >>>> >>>> Thanks! >>>> >>>> -Will >>>> >>>> _______________________________________________ >>>> General mailing list >>>> [email protected] >>>> http://developer.marklogic.com/mailman/listinfo/general >>> >>> _______________________________________________ >>> General mailing list >>> [email protected] >>> http://developer.marklogic.com/mailman/listinfo/general >>> _______________________________________________ >>> General mailing list >>> [email protected] >>> http://developer.marklogic.com/mailman/listinfo/general >>> >> >> _______________________________________________ >> General mailing list >> [email protected] >> http://developer.marklogic.com/mailman/listinfo/general >> _______________________________________________ >> General mailing list >> [email protected] >> http://developer.marklogic.com/mailman/listinfo/general >> > > _______________________________________________ > General mailing list > [email protected] > http://developer.marklogic.com/mailman/listinfo/general > _______________________________________________ > General mailing list > [email protected] > http://developer.marklogic.com/mailman/listinfo/general > _______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
