Re: [MarkLogic Dev General] Constrained searches across multiple element names

Michael Blakeley Wed, 28 Aug 2013 08:22:21 -0700

The MarkLogic implementation of XPath scales very well, as long as you avoid 
unsearchable steps. The problem is that '//*' is unsearchable. As is often the 
case you can get partway to your goal with the right query, but for optimal 
performance you might want to change the XML too.

Without changing the XML, use a query that enumerates all the possible element 
QNames. This may seem tedious, but attribute values are indexed as 
"element-QName + attribute-QName + value". So if a query term wants to look up 
an attribute value in the index, it also needs the element QName.

    //(dog|cat|rat|...)[@id = $values]

That should scale pretty well. Or it might be nicer to express that as a 
composable cts:query term.

    cts:element-attribute-value-query(
      for $i in ('dog', 'cat', 'rat', ...) return QName($my-ns, $i),
      QName('', 'id'), $values)

You can encapsulate that into a function for convenience, and use the output 
with cts:search or search:resolve. You can also encapsulate the QName list in a 
variable, which might save a few millis if you use it more than once per query. 
But that will scale much better than '//*' will.

To improve the performance even more you would want to put all those id 
attributes into a range index. That avoids disk reads to cache term lists, 
because the data is already in memory.

    cts:element-attribute-range-query(
      for $i in ('dog', 'cat', 'rat', ...) return QName($my-ns, $i),
      QName('', 'id'), '=', $values)

But than makes N range index lookups, one per element QName. And we have to 
create all N of those element-attribute range indexes. It would be even more 
efficient if we could turn that a single lookup. With ML6 we could create a 
path range index on //(dog|cat|rat|...)/@id, using the right namespace. Since 
this is an id I would use the string type with the unicode codepoint collation.

    cts:path-range-query("//(dog|cat|rat|...)/@id", "=", $values)

We still have to maintain the list of QNames in that XPath expression, but we 
can write a library module that enumerates all the path range indexes we use. 
So this is pretty good, and probably close to optimal.

Would a field range index help? Probably not, because fields index element text 
not attributes. But I mentioned that changing the XML might help. If you can 
replace the @id nodes with id elements, then fields would work. But they would 
probably be overkill, because the problem becomes simpler with an element.

    //myns:id[. = $values]/..

Note that '..' is unsearchable, but that's fine because we've done all the 
index lookups we need before we get to that step. Or use it as a cts:query term:

    cts:element-value-query(QName($my-ns, 'id'), $values)

Or create an element range index and query that:

    cts:element-range-query(QName($my-ns, 'id'), '=', $values)

No doubt there are some use-cases where an id element wouldn't work: where the 
parent needs to have a simple value, for example. But it is an option worth 
considering.

-- Mike

On 28 Aug 2013, at 07:08 , Iain Tatch <[email protected]> wrote:

> Hi Asit
> 
> Thank you for your reply.
> 
> While that search will certainly work, I neglected to mention that our 
> document store is currently > 50 million documents (and increasing 
> daily), and we need to perform the search in a timely fashion 
> (preferably within 1 second), and therefore searching on XPath, 
> especially using the // notation to search all elements of all 
> documents, doesn't scale for that sort of environment.
> 
> So we're really looking for a solution that use the cts:query logic so 
> that we can leverage MarkLogic's indexes -- we've used this extremely 
> successfully up to now for other queries, but haven't yet had to deal 
> with a single user constraint being required to search the same 
> attribute across multiple elements.
> 
> 
> Thanks.
> Iain
> 
> Senior software engineer
> BBC FM Publishing Services Editorial Metadata
> 
> 
> 
> On 28/08/13 14:56, [email protected] wrote:
>> Hi Iain,
>> 
>> You can keep all documents in a collection and then use query like below :
>> 
>> let $id := "abc123"
>> Return
>> cts:search(fn:collection("MyCollection")//*[@id[.=$id]],())
>> 
>> This query will give you that particular document which contain id="abc123"
>> 
>> Similar you can use with combination
>> 
>> let $id := "abc123"
>> let $alt-id :="zyx987"
>> Return
>> cts:search(fn:collection("MyCollection")//*[@id[.=$id]][@alt-id[.=$alt-id]],())
>> 
>> This query will give you that particular document which contain id="abc123" 
>> and alt-id=" zyx987"
>> 
>> Hope, it will help you.
>> 
>> Regards,
>> Asit Nautiyal
>> 
>> -----Original Message-----
>> From: [email protected] 
>> [mailto:[email protected]] On Behalf Of Iain Tatch
>> Sent: Wednesday, August 28, 2013 6:58 PM
>> To: General MarkLogic Developer Discussion
>> Subject: [MarkLogic Dev General] Constrained searches across multiple 
>> element names
>> 
>> Hello all
>> 
>> We have the following (simplified) example documents:
>> 
>> <cat id="abc123" alt-id="zyx987" xmlns="blah"> .. </cat>
>> 
>> <dog id="aaa999" alt-id="bbb888" xmlns="blah"> .. </dog>
>> 
>> <elephant id="xxxxxx" alt-id="yyyyyy" xmlns="blah"> .. </elephant>
>> 
>> In other words, our documents are all in the same namespace, they all have 
>> attributes @id and @alt-id, but the root node might be many different types 
>> (I've specified 'cat', 'dog', 'elephant' here but in reality there are not 
>> just 3, there are potentially dozens).
>> 
>> We'd like to give our users the ability to search this data with queries 
>> such as
>> 
>> id:aaa999
>> 
>> or
>> 
>> alt-id:zyx987
>> 
>> but I'm having trouble composing a valid set of Search constraint options to 
>> achieve this.  As far as I can tell, for both value constraints and word 
>> constraints I'd need to specify an element name as well as the attribute 
>> name, and obviously in this case there could be many different element names.
>> 
>> 
>> Any suggestions gratefully received!
>> 
>> 
>> TIA
>> Iain Tatch
>> Senior software engineer
>> BBC FM Publishing Services Editorial Metadata
>> 
>> _______________________________________________
>> General mailing list
>> [email protected]
>> http://developer.marklogic.com/mailman/listinfo/general
>> This e-mail and any files transmitted with it are for the sole use of the 
>> intended recipient(s) and may contain confidential and privileged 
>> information. If you are not the intended recipient(s), please reply to the 
>> sender and destroy all copies of the original message. Any unauthorized 
>> review, use, disclosure, dissemination, forwarding, printing or copying of 
>> this email, and/or any action taken in reliance on the contents of this 
>> e-mail is strictly prohibited and may be unlawful.
> 
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
> 

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Constrained searches across multiple element names

Reply via email to