In general, the goal is to organize your documents so that they make sense as 
standalone units for queries and updates. Based on those statistics, it looks 
to me like you're in a pretty good place for that, already.

If your documents were 20x larger, and you had a strict separation between 
'doc' and 'cite' content, then fragmentation might make sense. From a strict 
data throughput point of view, the ideal would be if every fragment were 
exactly the size of a disk I/O (which might be 32-256 kB). You are in that 
ballpark: your average document is smaller than an I/O, and your largest 
documents aren't all that large. So I would not fragment those documents.

For the facets, I would stick with the multiple range indexes strategy.

-- Mike

On 18 Oct 2011, at 18:47 , Will Thompson wrote:

> Thanks Mike, this is very helpful.
> 
> We store our XML per chapter, but fragment on /doc (typically 1st level 
> headings under chapter) in ML. I ran some numbers -- These are the values for 
> the number of characters in a /doc:
> 
> <stats xmlns:xml="http://www.w3.org/XML/1998/namespace";>
> <median>1720</median>
> <min>91</min>
> <max>158093</max>
> <avg>4208.54800490263</avg>
> </stats>
> 
> And these are the values for number of cites per /doc:
> 
> <stats xmlns:xml="http://www.w3.org/XML/1998/namespace";>
> <median>4</median>
> <min>0</min>
> <max>531</max>
> <avg>12.9925098733488</avg>
> </stats>
> 
> A typical citation is about 150 characters of data or less; 250 if you 
> include metadata. Since we have a relatively small database, I assumed it 
> would be most performant to include a fragment root for /cite, even if it 
> bloats up our index.
> 
> -Will
> 
> 
> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of Michael Blakeley
> Sent: Tuesday, October 18, 2011 8:06 PM
> To: General MarkLogic Developer Discussion
> Subject: Re: [MarkLogic Dev General] How to get different facet counts for 
> different searchable-expression in Search API
> 
> Yes, the whole purpose of fragment rules is to create more fragments. It 
> isn't necessarily a win for accurate estimate: accuracy could also get worse, 
> depending on your use cases.
> 
> But if the fragment counts didn't change, then the fragment root probably had 
> an error in its specification. Wrong namespace, maybe? I wouldn't worry about 
> it: from what I'm reading in this thread, I don't think you'd benefit from 
> fragmenting these documents. If you do want to review it more thoroughly, 
> though, I'd start by asking for content statistics. What are the min, max, 
> and median sizes of your documents, and how much of that size is in cite 
> elements? What are the min, max, and median counts of cite elements per 
> document?
> 
> -- Mike
> 
> On 18 Oct 2011, at 17:53 , Will Thompson wrote:
> 
>> Ah, that makes sense. I am already dynamically building the options node 
>> based on some querystring params, so it should be pretty straightforward to 
>> build this in based on the results of a string search. I doubt it would 
>> generate false positives.
>> 
>> Also, when I was testing, I did create a fragment root on 'cite', but the 
>> fragment counts are the same regardless. Should they be different 
>> with/without? I thought having the fragment root would just improve the 
>> accuracy of xdmp:estimate().
>> 
>> -W
>> 
>> 
>> -----Original Message-----
>> From: [email protected] 
>> [mailto:[email protected]] On Behalf Of Michael 
>> Blakeley
>> Sent: Tuesday, October 18, 2011 7:31 PM
>> To: General MarkLogic Developer Discussion
>> Subject: Re: [MarkLogic Dev General] How to get different facet counts for 
>> different searchable-expression in Search API
>> 
>> Will, if I can jump in.... I think your idea of using different QNames is 
>> the right way to look at it.
>> 
>> Facets are built from range indexes, and range indexes contain lists of 
>> values and fragment ids for a given QName. So if the query matches the 
>> fragment, the facet will show all the values in that fragment. In your case 
>> the fragment is the entire document, so you will see all the values in the 
>> matching documents, whether they occur under /doc or under /doc//cite. Now, 
>> you *could* create a fragment root on 'cite', but I think that would be 
>> counter-productive. It's better to use different QNames and have different 
>> range indexes.
>> 
>> So I think what you'd want to do is simply arrange for a different set of 
>> search options for doc vs cite, including both searchable expression and 
>> constraints. Testing for that could be as simple as a call to 
>> cts:contains($user-search, 'select:cite') before you call search:search(). 
>> Or if that might generate false positives, you could search:parse the user 
>> query and then look at the cts:query XML to see whether or not the parser 
>> found a select:cite term. If it did, then you can switch to the correct 
>> options before calling search:resolve.
>> 
>> -- Mike
>> 
>> On 18 Oct 2011, at 17:14 , Will Thompson wrote:
>> 
>>> Micah,
>>> 
>>> I think I may have explained poorly. This is essentially what I'm doing -- 
>>> Docs are, generally, like this:
>>> 
>>> <doc>
>>> <search-meta/>
>>> <p>...<cite><search-meta/></cite>...</p>
>>> <section>
>>> <p>...<cite><search-meta/></cite>...</p>
>>> ...
>>> </section>
>>> </doc>
>>> 
>>> Searches operate over //doc by default, but if you add the operator/state 
>>> "select:cite" it changes the searchable expression to //cite. The results 
>>> are correct, but the problem is that the facet counts appear to be for 
>>> *both* doc and cite metadata, and thus do not change when toggling 
>>> searchable-expressions via operator/state.
>>> 
>>> This won't make any sense to our users, who will expect the facet counts to 
>>> match what they think they're searching for.
>>> 
>>> -W
>>> 
>>> 
>>> -----Original Message-----
>>> From: [email protected] 
>>> [mailto:[email protected]] On Behalf Of Micah Dubinko
>>> Sent: Tuesday, October 18, 2011 6:56 PM
>>> To: General MarkLogic Developer Discussion
>>> Subject: Re: [MarkLogic Dev General] How to get different facet counts for 
>>> different searchable-expression in Search API
>>> 
>>> Hi Will,
>>> 
>>> Everything you want to search exists in document fragments (not properties) 
>>> right?
>>> 
>>> What would happen if you switched in a different searchable-expression via 
>>> operator and state? The combined query is taken into account by faceting, 
>>> but the searchable-expression is not.
>>> 
>>> -m
>>> 
>>> 
>>> On Oct 18, 2011, at 4:42 PM, Will Thompson wrote:
>>> 
>>>> Our app has typically searched only document-type elements, but I recently 
>>>> added metadata to citation elements (contained within and scattered about 
>>>> document elements) so that they can be optionally searched using an 
>>>> operator. i.e.: "term1 term2 select:citations" The operator changes the 
>>>> searchable-expression and transform-results to search only citation 
>>>> elements and return citation-specific snippets.
>>>> 
>>>> However, I need the facet counts to reflect the search being performed - 
>>>> i.e.: only show estimates for document element direct-child metadata 
>>>> during normal search, and only for citations when that is toggled using 
>>>> the operator. 
>>>> 
>>>> My first thought was to use different names or namespace for the citation 
>>>> metadata and have the operator toggle a separate set of constraints 
>>>> associated with those names. But constraints are not supported children of 
>>>> search:state under search:operator.
>> 
>>>> 
>>>> Any ideas on how to accomplish this with Search API? 
>>>> 
>>>> Thanks!
>>>> 
>>>> -Will
>>>> 
>>>> _______________________________________________
>>>> General mailing list
>>>> [email protected]
>>>> http://developer.marklogic.com/mailman/listinfo/general
>>> 
>>> _______________________________________________
>>> General mailing list
>>> [email protected]
>>> http://developer.marklogic.com/mailman/listinfo/general
>>> _______________________________________________
>>> General mailing list
>>> [email protected]
>>> http://developer.marklogic.com/mailman/listinfo/general
>>> 
>> 
>> _______________________________________________
>> General mailing list
>> [email protected]
>> http://developer.marklogic.com/mailman/listinfo/general
>> _______________________________________________
>> General mailing list
>> [email protected]
>> http://developer.marklogic.com/mailman/listinfo/general
>> 
> 
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
> 

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to