Re: [MarkLogic Dev General] Improving XPathPerformance onSearchResults

Steve Thu, 13 Nov 2008 07:03:36 -0800

Ok, I'll try and explain a little bit better what I'm trying to do.

I've got a set of documents all in the following formats:


<doc>
  <classifications>
    <classification value="234" />
    <classification value="543" />
  </classifications>
  <keywords>
    <keyword value="term1" count="23" />
    <keyword value="term2" count="43" />
  </keywords>
</doc>

Each document belongs to a set of collections. All belong to the
collection "collections/docs". This has been used to separate this
data from any other data of a different type that I might want to put
into the database. The document also belongs to collections that are
representative of their classifications. The idea of this was to
enable me to quickly filter the documents based on classification, by
choosing to execute a query against a classification. For example, in
the case of the document above, it would belong to the following
collections:

"collections/docs"
"collections/class/234"
"collections/class/543"

Now, what I want to do, is to create a summary document for a
classification. This summary document needs to take the following
format.

<summary>
  <classification value="234" />
  <keywords>
    <keyword value="term1" count="123" />
    <keyword value="term2" count="231" />
  </keywords>
<summary>

The count attribute is the sum of the count values from the documents
that exist in the classification for that term.

I've got a stack of original documents and a lot of classifcations and
need to generate a summary for each one of them. I need to create
these summaries as quick as possible because of the amount that I need
to create. Can anyone help me out with this, or am I doing something
wrong in my approach.

Cheers



On Thu, Nov 13, 2008 at 2:49 PM, Whitby, Rob, CMG
<[EMAIL PROTECTED]> wrote:
> The query I wrote loops through all the values in the index, and
> cts:frequency returns how many fragments the value occurs in. I'm not
> clear exactly what you're trying to do, but the start of your query can
> use the index:
>
>> let $allKeywords := fn:collection()/doc/keywords/keyword/@value
>> let $distinct := fn:distinct-values($allKeywords)
>>
>> for $k in $distinct
>
> can be replaced with:
>
> for $k in cts:element-attribute-values(xs:QName("keyword",
> xs:QName("value"))
>
>
> Hope this helps!
>
>
>
> -----Original Message-----
> From: Steve [mailto:[EMAIL PROTECTED]
> Sent: 13 November 2008 10:34
> To: Whitby, Rob, CMG
> Cc: General Mark Logic Developer Discussion
> Subject: Re: [MarkLogic Dev General] Improving XPathPerformance
> onSearchResults
>
> I've looked at the query and tried to execute it, but I get an error
> telling me that the argument passed to cts:frequency(..) is not of type
> item(). And looking at the query, won't that return the count of
> <keyword> elements rather than the sum of the count attributes of each
> distinct keyword element?
>
> On Thu, Nov 13, 2008 at 10:22 AM, Whitby, Rob, CMG
> <[EMAIL PROTECTED]> wrote:
>> This is where an index really is useful..
>>
>> for $keyword in cts:element-attribute-values(xs:Qname("keyword",
>> xs:Qname("value"))
>> let $count := cts:frequency($keyword)
>> order by $count descending
>> return <keyword value="{$keyword}" count="{$count}"/>
>>
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: [EMAIL PROTECTED]
>> [mailto:[EMAIL PROTECTED] On Behalf Of Steve
>> Sent: 13 November 2008 10:13
>> To: Michael Blakeley
>> Cc: General Mark Logic Developer Discussion
>> Subject: Re: [MarkLogic Dev General] Improving XPathPerformance
>> onSearchResults
>>
>> Sorry if I seem to keep moving the goalposts. I've been playing with
>> some of the parameters, and experimenting with my XQuery and the best
>> performance I can get is using the following query. Unfortunately, it
>> still takes a considerable amount of time to execute.  Profiling shows
>
>> that the XPath $nodes/@count is taking the time.
>>
>> let $allKeywords := fn:collection()/doc/keywords/keyword/@value
>> let $distinct := fn:distinct-values($allKeywords)
>>
>> for $k in $distinct
>>  return
>>    let $search := cts:element-attribute-word-query(fn:QName("",
>> "keyword"), fn:QName("", "value"), $k)
>>    let $results := cts:search(fn:collection()/doc/keywords/keyword,
>> $search)
>>    let $nodes := [EMAIL PROTECTED] eq $k]
>>    let $counts := $nodes/@count
>>       return
>>          <keyword value="{$k}" total="{fn:sum($counts)}" />
>>
>>
>> On Thu, Nov 13, 2008 at 9:21 AM, Steve <[EMAIL PROTECTED]>
>> wrote:
>>> I've been applying what I've learnt so far from this thread, but I'm
>>> having a bit of trouble getting good performance when I put it all
>>> together. The query I'm trying to execute in order to get the sum of
>>> a
>>
>>> count of keywords is below:
>>>
>>> let $allKeywords := fn:collection()/doc/keywords/keyword/@value
>>> let $distinct := fn:distinct-values($allKeywords)
>>>
>>> for $k in $distinct
>>>  return
>>>    let $search := cts:element-attribute-word-query(fn:QName("",
>>> "keyword"), fn:QName("", "value"), $k)
>>>    let $results := cts:search(fn:collection()/doc/keywords/keyword,
>>> $search)/@count
>>>       return
>>> fn:sum($results)
>>>
>>> Running profiling on the query shows me that it's the XPath stuff I
>>> do
>>
>>> on the search results that's holding everything up, can anyone advise
>
>>> how I can improve this?
>>>
>>> Thankl
>>>
>>> On Wed, Nov 12, 2008 at 7:24 PM, Michael Blakeley
>>> <[EMAIL PROTECTED]> wrote:
>>>> To be fair, absorbing the architecture and indexing behavior of a
>>>> modern RDBMS isn't trivial either. XML content adds another
>>>> dimension, but I hope you find the performance guide at
>>>> http://developer.marklogic.com/pubs/4.0/
>>>> helpful. There are also useful bits of server architecture
>>>> discussion
>>
>>>> in the dev and admin guides.
>>>>
>>>> In the general case I wouldn't expect adding a range index to
>>>> greatly
>>
>>>> improve value query performance. The list cache is pretty efficient
>>>> at keeping frequently-used terms in memory.
>>>>
>>>> Usually the range indexes are created for applications that need
>>>> particular
>>>> features: fast sorting on a node value, fast range queries, fast
>>>> access to distinct values, etc.
>>>>
>>>> -- Mike
>>>>
>>>> Whitby, Rob, CMG wrote:
>>>>>
>>>>> Wow, I didn't realise that. It will improve performance though
>>>>> right? On a large database I assume the index of all XML elements
>>>>> and attributes can't be held in memory.
>>>>>
>>>>> Understanding how the functions relate to the indexes is probably
>>>>> one of areas I've found hardest with MarkLogic.
>>>>>
>>>>> Thanks
>>>>> Rob
>>>>>
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: [EMAIL PROTECTED]
>>>>> [mailto:[EMAIL PROTECTED] On Behalf Of
>>>>> Michael Blakeley
>>>>> Sent: 12 November 2008 16:55
>>>>> To: General Mark Logic Developer Discussion
>>>>> Cc: [EMAIL PROTECTED]
>>>>> Subject: Re: [MarkLogic Dev General] Improving XPathPerformance
>>>>> onSearchResults
>>>>>
>>>>> Actually that query does *not* require any special indexes. The
>>>>> server always indexes all XML element values and element-attribute
>> values.
>>>>>
>>>>> You would only need an attribute range index for a fast "order by"
>>>>> on keyword/@value, or for a cts:attribute-value-range-query term,
>>>>> or
>>
>>>>> for
>>>>> cts:element-attribute-values() and its associated functions.
>>>>>
>>>>> -- Mike
>>>>>
>>>>> Whitby, Rob, CMG wrote:
>>>>>>
>>>>>> If you put an attribute range index on keyword/@value you can do
>>>>>> something like this:
>>>>>>
>>>>>> cts:search(
>>>>>>  /doc/classifications/classification,
>>>>>>  cts:element-attribute-value-query(xs:Qname("keyword",
>>>>>> xs:Qname("value"), "something")
>>>>>> )
>>>>>>
>>>>>> (untested!)
>>>>>>
>>>>>> Rob
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: [EMAIL PROTECTED]
>>>>>> [mailto:[EMAIL PROTECTED] On Behalf Of
>>>>>> Steve
>>>>>> Sent: 12 November 2008 14:41
>>>>>> To: James Clippinger
>>>>>> Cc: General Mark Logic Developer Discussion
>>>>>> Subject: Re: [MarkLogic Dev General] Improving XPath Performance
>>>>>> onSearchResults
>>>>>>
>>>>>> I should probably add that I'm trying to extract all
>>>>>> classification
>>
>>>>>> values for the documents that have a specific keyword value.
>>>>>>
>>>>>> On Wed, Nov 12, 2008 at 2:40 PM, Steve <[EMAIL PROTECTED]>
>>>>>> wrote:
>>>>>>>
>>>>>>> Thanks for your response.
>>>>>>>
>>>>>>> I've tried your suggestion and it doesn't really help. Looking at
>
>>>>>>> the
>>>>>
>>>>>>> profiling document, I can see that it's clearly the XPath on the
>>>>>>> document results that is slowing the who thing down. Is there any
>
>>>>>>> other ways that I can improve this. I've included a sample
>>>>>>> document (small), so you can see what I'm trying to achieve.
>>>>>>>
>>>>>>> <doc>
>>>>>>>  <classifications>
>>>>>>>   <classification value="123" />
>>>>>>>   <classification value="324" />
>>>>>>>  </classifications>
>>>>>>>  <keywords>
>>>>>>>   <keyword value="word1" />
>>>>>>>   <keyword value="word2" />
>>>>>>>
>>>>>>>  </keywords>
>>>>>>> </doc>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Nov 12, 2008 at 2:24 PM, James Clippinger
>>>>>>> <[EMAIL PROTECTED]> wrote:
>>>>>>>>
>>>>>>>> Steve, your query is doing some heavyweight filtering for the
>>>>>>>> XPath because it is doing two steps:
>>>>>>>>
>>>>>>>> 1) Execute the cts:search(): generate a list of all documents
>>>>>>>> matching the query in relevance order.
>>>>>>>>
>>>>>>>> 2) Execute the XPath: reorder the documents into document order
>>>>>>>> and find only those with /doc/classifications/classification
>>>>>>>> elements, returning those classification elements.
>>>>>>>>
>>>>>>>> Since you are using XPath and thus returning results in document
>
>>>>>>>> order, you probably want to use cts:contains() in an XPath
>>>>>>>> predicate
>>>>>
>>>>>>>> rather than cts:search().  cts:contains() in a rooted XPath
>>>>>>>> expression will use the search indexes when appropriate, so it's
>
>>>>>>>> as fast as the equivalent
>>>>>>>> cts:search() expression.  Try this:
>>>>>>>>
>>>>>>>> let $search := cts:element-attribute-word-query(fn:QName("",
>>>>>>>> "keyword"), fn:QName("", "value"), "something") return
>>>>>>>> fn:collection()/doc[cts:contains(.,
>>>>>>>> $search)/classifications/classification
>>>>>>>>
>>>>>>>> James
>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: [EMAIL PROTECTED]
>>>>>>>>> [mailto:[EMAIL PROTECTED] On Behalf Of
>>>>>>>>> Steve
>>>>>>>>> Sent: Wednesday, November 12, 2008 8:54 AM
>>>>>>>>> To: [email protected]
>>>>>>>>> Subject: [MarkLogic Dev General] Improving XPath Performance on
>
>>>>>>>>> SearchResults
>>>>>>>>>
>>>>>>>>> I've written a query which I use to search my data set and I am
>
>>>>>>>>> able to get the results back very quickly. However the results
>>>>>>>>> that I get back show the complete document that the search
>>>>>>>>> matched, where as I
>>>>>
>>>>>>>>> want a particular node returned.
>>>>>>>>> At the moment I'm doing this:
>>>>>>>>>
>>>>>>>>> let $search := cts:element-attribute-word-query(fn:QName("",
>>>>>>>>> "keyword"), fn:QName("", "value"), "something") let $results :=
>
>>>>>>>>> cts:search(fn:collection(),
>> $search)/doc/classifications/classification
>>>>>>>>>   return $results
>>>>>>>>>
>>>>>>>>> I've tried profiling this query and I've found that there is a
>>>>>>>>> big lag filtering the $results of the search using XPath.
>>>>>>>>> Is there any way, either through using a different query or
>>>>>>>>> search notation, or by indexes etc that I can speed this up.
>>>>>>>>>
>>>>>>>>> Thanks in advance...
>>>>>>>>> _______________________________________________
>>>>>>>>> General mailing list
>>>>>>>>> [email protected]
>>>>>>>>> http://xqzone.com/mailman/listinfo/general
>>>>>>>>>
>>>>>> _______________________________________________
>>>>>> General mailing list
>>>>>> [email protected]
>>>>>> http://xqzone.com/mailman/listinfo/general
>>>>>> _______________________________________________
>>>>>> General mailing list
>>>>>> [email protected]
>>>>>> http://xqzone.com/mailman/listinfo/general
>>>>>
>>>>> _______________________________________________
>>>>> General mailing list
>>>>> [email protected]
>>>>> http://xqzone.com/mailman/listinfo/general
>>>>> _______________________________________________
>>>>> General mailing list
>>>>> [email protected]
>>>>> http://xqzone.com/mailman/listinfo/general
>>>>
>>>>
>>>
>> _______________________________________________
>> General mailing list
>> [email protected]
>> http://xqzone.com/mailman/listinfo/general
>>
>
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Improving XPathPerformance onSearchResults

Reply via email to