Re: [MarkLogic Dev General] How to get different facet counts for different searchable-expression in Search API

Michael Blakeley Thu, 10 Nov 2011 09:27:31 -0800

I wouldn't jump into setting fragment roots. Fragment rules are a mistake in at 
least 80% of applications.


If you fragment your documents, you won't be able to search them as documents 
very easily. Any query that crosses fragment boundaries has to be implemented 
as some sort of join, and the server doesn't do much of anything for you in 
those cases. So if you have 'head' and 'body', like xhtml, and you fragment on 
'head'... now the search API can't help you with searches constraints that 
check both head and body.

Aside from that, adding fragments increases both memory utilization and disk 
utilization. But usually the effect on queries and indexing is paramount.

When should you use fragments? Mostly for large documents that cannot be broken 
apart for intrinsic reasons. Large means that the typical tree size is larger 
than system's on-die cache size. "Cannot be broken apart" means "cannot", not 
"that would resemble work".

For example, an RDBMS will often export a table as a giant document with 
row-oriented child elements. Don't fragment that. Break it up. This is a 
document-oriented environment, so map each row to a document.

Books might seem like a good candidate for fragmentation, but not always. Often 
it's better to represent a book as a directory, with metadata in a manifest and 
each chapter in a document. Most users will want to search at the chapter level 
or lower anyway.

Getting back to facets and searchable expressions, my answer is the same as 
below. In most cases you'll have a finite set of searchable expressions that 
interest you. So use QNames that express that. For example, you might have 
'tag' in head and 'tag' in body. Change that by using different local names 
('head-tag' vs 'body-tag') or namespaces ('h:tag' vs 'b:tag').

-- Mike

On 10 Nov 2011, at 08:25 , Murray, Gregory wrote:

> Geert,
> 
> I don't know how to set an element as a fragment root, which I assume means 
> that the element/fragment level becomes the bases for indexing, rather than 
> the document level. That sounds like exactly what I need. Which part of the 
> documentation discusses that? I'm not finding it.
> 
> When you say "big impact" do you mean a drag on performance?
> 
> Thanks,
> Greg
> 
> 
> On Nov 10, 2011, at 9:11 AM, Geert Josten wrote:
> 
>> Hi Greg,
>> 
>> To my knowledge it is like you say: facet counts are based on fragments,
>> not on search results. But the lengthy explanation by Mike (over several
>> mails) confused me a bit. I still need to reread it thoroughly.
>> 
>> One solution for sure is to cancel the difference between what is matched
>> using the searchable-expression and what is stored as separate fragment.
>> You can do that by declaring the element that you search for as a fragment
>> root. Depending on the occurrence of that element within each document,
>> this could have big impact, so this might not be the most wise decision.
>> Just mentioning it as a possible option..
>> 
>> Kind regards,
>> Geert
>> 
>> -----Oorspronkelijk bericht-----
>> Van: [email protected]
>> [mailto:[email protected]] Namens Murray, Gregory
>> Verzonden: donderdag 10 november 2011 14:45
>> Aan: General MarkLogic Developer Discussion
>> Onderwerp: Re: [MarkLogic Dev General] How to get different facet counts
>> for different searchable-expression in Search API
>> 
>> I should have mentioned that I'm using 4.2-1
>> 
>> Any suggestions greatly appreciated.
>> 
>> Thanks,
>> Greg
>> 
>> On Nov 9, 2011, at 5:21 PM, Murray, Gregory wrote:
>> 
>>> I'm having a similar problem with facet counts when using
>> <searchable-expression>. After reading this thread, I'm afraid I still
>> don't understand how to circumvent the problem. When using
>> <searchable-expression>, it appears that the search results are
>> constrained to that expression whereas the facet counts are not. Is there
>> a facet-related option to similarly constrain a facet to an XPath
>> expression? I've seen references to the "fragment-frequency" option, but
>> appears to have no effect in this context.
>>> 
>>> Many thanks,
>>> Greg
>>> 
>>> Gregory Murray
>>> Digital Library Application Developer
>>> Princeton Theological Seminary
>>> 
>>> 
>>> On Oct 18, 2011, at 8:30 PM, Michael Blakeley wrote:
>>> 
>>>> Will, if I can jump in.... I think your idea of using different QNames
>> is the right way to look at it.
>>>> 
>>>> Facets are built from range indexes, and range indexes contain lists of
>> values and fragment ids for a given QName. So if the query matches the
>> fragment, the facet will show all the values in that fragment. In your
>> case the fragment is the entire document, so you will see all the values
>> in the matching documents, whether they occur under /doc or under
>> /doc//cite. Now, you *could* create a fragment root on 'cite', but I think
>> that would be counter-productive. It's better to use different QNames and
>> have different range indexes.
>>>> 
>>>> So I think what you'd want to do is simply arrange for a different set
>> of search options for doc vs cite, including both searchable expression
>> and constraints. Testing for that could be as simple as a call to
>> cts:contains($user-search, 'select:cite') before you call search:search().
>> Or if that might generate false positives, you could search:parse the user
>> query and then look at the cts:query XML to see whether or not the parser
>> found a select:cite term. If it did, then you can switch to the correct
>> options before calling search:resolve.
>>>> 
>>>> -- Mike
>>>> 
>>>> On 18 Oct 2011, at 17:14 , Will Thompson wrote:
>>>> 
>>>>> Micah,
>>>>> 
>>>>> I think I may have explained poorly. This is essentially what I'm
>> doing -- Docs are, generally, like this:
>>>>> 
>>>>> <doc>
>>>>> <search-meta/>
>>>>> <p>...<cite><search-meta/></cite>...</p>
>>>>> <section>
>>>>> <p>...<cite><search-meta/></cite>...</p>
>>>>> ...
>>>>> </section>
>>>>> </doc>
>>>>> 
>>>>> Searches operate over //doc by default, but if you add the
>> operator/state "select:cite" it changes the searchable expression to
>> //cite. The results are correct, but the problem is that the facet counts
>> appear to be for *both* doc and cite metadata, and thus do not change when
>> toggling searchable-expressions via operator/state.
>>>>> 
>>>>> This won't make any sense to our users, who will expect the facet
>> counts to match what they think they're searching for.
>>>>> 
>>>>> -W
>>>>> 
>>>>> 
>>>>> -----Original Message-----
>>>>> From: [email protected]
>> [mailto:[email protected]] On Behalf Of Micah
>> Dubinko
>>>>> Sent: Tuesday, October 18, 2011 6:56 PM
>>>>> To: General MarkLogic Developer Discussion
>>>>> Subject: Re: [MarkLogic Dev General] How to get different facet counts
>> for different searchable-expression in Search API
>>>>> 
>>>>> Hi Will,
>>>>> 
>>>>> Everything you want to search exists in document fragments (not
>> properties) right?
>>>>> 
>>>>> What would happen if you switched in a different searchable-expression
>> via operator and state? The combined query is taken into account by
>> faceting, but the searchable-expression is not.
>>>>> 
>>>>> -m
>>>>> 
>>>>> 
>>>>> On Oct 18, 2011, at 4:42 PM, Will Thompson wrote:
>>>>> 
>>>>>> Our app has typically searched only document-type elements, but I
>> recently added metadata to citation elements (contained within and
>> scattered about document elements) so that they can be optionally searched
>> using an operator. i.e.: "term1 term2 select:citations" The operator
>> changes the searchable-expression and transform-results to search only
>> citation elements and return citation-specific snippets.
>>>>>> 
>>>>>> However, I need the facet counts to reflect the search being
>> performed - i.e.: only show estimates for document element direct-child
>> metadata during normal search, and only for citations when that is toggled
>> using the operator.
>>>>>> 
>>>>>> My first thought was to use different names or namespace for the
>> citation metadata and have the operator toggle a separate set of
>> constraints associated with those names. But constraints are not supported
>> children of search:state under search:operator.
>>>>>> 
>>>>>> Any ideas on how to accomplish this with Search API?
>>>>>> 
>>>>>> Thanks!
>>>>>> 
>>>>>> -Will
>>>>>> 
>>>>>> _______________________________________________
>>>>>> General mailing list
>>>>>> [email protected]
>>>>>> http://developer.marklogic.com/mailman/listinfo/general
>>>>> 
>>>>> _______________________________________________
>>>>> General mailing list
>>>>> [email protected]
>>>>> http://developer.marklogic.com/mailman/listinfo/general
>>>>> _______________________________________________
>>>>> General mailing list
>>>>> [email protected]
>>>>> http://developer.marklogic.com/mailman/listinfo/general
>>>>> 
>>>> 
>>>> _______________________________________________
>>>> General mailing list
>>>> [email protected]
>>>> http://developer.marklogic.com/mailman/listinfo/general
>>> 
>>> _______________________________________________
>>> General mailing list
>>> [email protected]
>>> http://developer.marklogic.com/mailman/listinfo/general
>> 
>> _______________________________________________
>> General mailing list
>> [email protected]
>> http://developer.marklogic.com/mailman/listinfo/general
>> _______________________________________________
>> General mailing list
>> [email protected]
>> http://developer.marklogic.com/mailman/listinfo/general
> 
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
> 

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] How to get different facet counts for different searchable-expression in Search API

Reply via email to