Mike - This is great info. Thanks,
Will -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Michael Blakeley Sent: Tuesday, October 18, 2011 9:08 PM To: General MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] How to get different facet counts for different searchable-expression in Search API In general, the goal is to organize your documents so that they make sense as standalone units for queries and updates. Based on those statistics, it looks to me like you're in a pretty good place for that, already. If your documents were 20x larger, and you had a strict separation between 'doc' and 'cite' content, then fragmentation might make sense. From a strict data throughput point of view, the ideal would be if every fragment were exactly the size of a disk I/O (which might be 32-256 kB). You are in that ballpark: your average document is smaller than an I/O, and your largest documents aren't all that large. So I would not fragment those documents. For the facets, I would stick with the multiple range indexes strategy. -- Mike On 18 Oct 2011, at 18:47 , Will Thompson wrote: > Thanks Mike, this is very helpful. > > We store our XML per chapter, but fragment on /doc (typically 1st level > headings under chapter) in ML. I ran some numbers -- These are the values for > the number of characters in a /doc: > > <stats xmlns:xml="http://www.w3.org/XML/1998/namespace"> > <median>1720</median> > <min>91</min> > <max>158093</max> > <avg>4208.54800490263</avg> > </stats> > > And these are the values for number of cites per /doc: > > <stats xmlns:xml="http://www.w3.org/XML/1998/namespace"> > <median>4</median> > <min>0</min> > <max>531</max> > <avg>12.9925098733488</avg> > </stats> > > A typical citation is about 150 characters of data or less; 250 if you > include metadata. Since we have a relatively small database, I assumed it > would be most performant to include a fragment root for /cite, even if it > bloats up our index. > > -Will > > > -----Original Message----- > From: [email protected] > [mailto:[email protected]] On Behalf Of Michael Blakeley > Sent: Tuesday, October 18, 2011 8:06 PM > To: General MarkLogic Developer Discussion > Subject: Re: [MarkLogic Dev General] How to get different facet counts for > different searchable-expression in Search API > > Yes, the whole purpose of fragment rules is to create more fragments. It > isn't necessarily a win for accurate estimate: accuracy could also get worse, > depending on your use cases. > > But if the fragment counts didn't change, then the fragment root probably had > an error in its specification. Wrong namespace, maybe? I wouldn't worry about > it: from what I'm reading in this thread, I don't think you'd benefit from > fragmenting these documents. If you do want to review it more thoroughly, > though, I'd start by asking for content statistics. What are the min, max, > and median sizes of your documents, and how much of that size is in cite > elements? What are the min, max, and median counts of cite elements per > document? > > -- Mike > > On 18 Oct 2011, at 17:53 , Will Thompson wrote: > >> Ah, that makes sense. I am already dynamically building the options node >> based on some querystring params, so it should be pretty straightforward to >> build this in based on the results of a string search. I doubt it would >> generate false positives. >> >> Also, when I was testing, I did create a fragment root on 'cite', but the >> fragment counts are the same regardless. Should they be different >> with/without? I thought having the fragment root would just improve the >> accuracy of xdmp:estimate(). >> >> -W >> >> >> -----Original Message----- >> From: [email protected] >> [mailto:[email protected]] On Behalf Of Michael >> Blakeley >> Sent: Tuesday, October 18, 2011 7:31 PM >> To: General MarkLogic Developer Discussion >> Subject: Re: [MarkLogic Dev General] How to get different facet counts for >> different searchable-expression in Search API >> >> Will, if I can jump in.... I think your idea of using different QNames is >> the right way to look at it. >> >> Facets are built from range indexes, and range indexes contain lists of >> values and fragment ids for a given QName. So if the query matches the >> fragment, the facet will show all the values in that fragment. In your case >> the fragment is the entire document, so you will see all the values in the >> matching documents, whether they occur under /doc or under /doc//cite. Now, >> you *could* create a fragment root on 'cite', but I think that would be >> counter-productive. It's better to use different QNames and have different >> range indexes. >> >> So I think what you'd want to do is simply arrange for a different set of >> search options for doc vs cite, including both searchable expression and >> constraints. Testing for that could be as simple as a call to >> cts:contains($user-search, 'select:cite') before you call search:search(). >> Or if that might generate false positives, you could search:parse the user >> query and then look at the cts:query XML to see whether or not the parser >> found a select:cite term. If it did, then you can switch to the correct >> options before calling search:resolve. >> >> -- Mike >> >> On 18 Oct 2011, at 17:14 , Will Thompson wrote: >> >>> Micah, >>> >>> I think I may have explained poorly. This is essentially what I'm doing -- >>> Docs are, generally, like this: >>> >>> <doc> >>> <search-meta/> >>> <p>...<cite><search-meta/></cite>...</p> >>> <section> >>> <p>...<cite><search-meta/></cite>...</p> >>> ... >>> </section> >>> </doc> >>> >>> Searches operate over //doc by default, but if you add the operator/state >>> "select:cite" it changes the searchable expression to //cite. The results >>> are correct, but the problem is that the facet counts appear to be for >>> *both* doc and cite metadata, and thus do not change when toggling >>> searchable-expressions via operator/state. >>> >>> This won't make any sense to our users, who will expect the facet counts to >>> match what they think they're searching for. >>> >>> -W >>> >>> >>> -----Original Message----- >>> From: [email protected] >>> [mailto:[email protected]] On Behalf Of Micah Dubinko >>> Sent: Tuesday, October 18, 2011 6:56 PM >>> To: General MarkLogic Developer Discussion >>> Subject: Re: [MarkLogic Dev General] How to get different facet counts for >>> different searchable-expression in Search API >>> >>> Hi Will, >>> >>> Everything you want to search exists in document fragments (not properties) >>> right? >>> >>> What would happen if you switched in a different searchable-expression via >>> operator and state? The combined query is taken into account by faceting, >>> but the searchable-expression is not. >>> >>> -m >>> >>> >>> On Oct 18, 2011, at 4:42 PM, Will Thompson wrote: >>> >>>> Our app has typically searched only document-type elements, but I recently >>>> added metadata to citation elements (contained within and scattered about >>>> document elements) so that they can be optionally searched using an >>>> operator. i.e.: "term1 term2 select:citations" The operator changes the >>>> searchable-expression and transform-results to search only citation >>>> elements and return citation-specific snippets. >>>> >>>> However, I need the facet counts to reflect the search being performed - >>>> i.e.: only show estimates for document element direct-child metadata >>>> during normal search, and only for citations when that is toggled using >>>> the operator. >>>> >>>> My first thought was to use different names or namespace for the citation >>>> metadata and have the operator toggle a separate set of constraints >>>> associated with those names. But constraints are not supported children of >>>> search:state under search:operator. >> >>>> >>>> Any ideas on how to accomplish this with Search API? >>>> >>>> Thanks! >>>> >>>> -Will >>>> >>>> _______________________________________________ >>>> General mailing list >>>> [email protected] >>>> http://developer.marklogic.com/mailman/listinfo/general >>> >>> _______________________________________________ >>> General mailing list >>> [email protected] >>> http://developer.marklogic.com/mailman/listinfo/general >>> _______________________________________________ >>> General mailing list >>> [email protected] >>> http://developer.marklogic.com/mailman/listinfo/general >>> >> >> _______________________________________________ >> General mailing list >> [email protected] >> http://developer.marklogic.com/mailman/listinfo/general >> _______________________________________________ >> General mailing list >> [email protected] >> http://developer.marklogic.com/mailman/listinfo/general >> > > _______________________________________________ > General mailing list > [email protected] > http://developer.marklogic.com/mailman/listinfo/general > _______________________________________________ > General mailing list > [email protected] > http://developer.marklogic.com/mailman/listinfo/general > _______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general _______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
