Re: [MarkLogic Dev General] Improving XPathPerformance onSearchResults

Steve Thu, 13 Nov 2008 08:35:01 -0800

There's about 1200 docs in a classification, with approximately
something like 500 keywords each.


On Thu, Nov 13, 2008 at 3:54 PM, Frank Sanders
<[EMAIL PROTECTED]> wrote:
> Steve,
>
>        About how many documents are you actually talking about?
>
> -fs
>
> -----Original Message-----
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] On Behalf Of Steve
> Sent: Thursday, November 13, 2008 10:04 AM
> To: Whitby, Rob, CMG
> Cc: General Mark Logic Developer Discussion
> Subject: Re: [MarkLogic Dev General] Improving XPathPerformance
> onSearchResults
>
> Ok, I'll try and explain a little bit better what I'm trying to do.
>
> I've got a set of documents all in the following formats:
>
> <doc>
>  <classifications>
>    <classification value="234" />
>    <classification value="543" />
>  </classifications>
>  <keywords>
>    <keyword value="term1" count="23" />
>    <keyword value="term2" count="43" />
>  </keywords>
> </doc>
>
> Each document belongs to a set of collections. All belong to the
> collection "collections/docs". This has been used to separate this
> data from any other data of a different type that I might want to put
> into the database. The document also belongs to collections that are
> representative of their classifications. The idea of this was to
> enable me to quickly filter the documents based on classification, by
> choosing to execute a query against a classification. For example, in
> the case of the document above, it would belong to the following
> collections:
>
> "collections/docs"
> "collections/class/234"
> "collections/class/543"
>
> Now, what I want to do, is to create a summary document for a
> classification. This summary document needs to take the following
> format.
>
> <summary>
>  <classification value="234" />
>  <keywords>
>    <keyword value="term1" count="123" />
>    <keyword value="term2" count="231" />
>  </keywords>
> <summary>
>
> The count attribute is the sum of the count values from the documents
> that exist in the classification for that term.
>
> I've got a stack of original documents and a lot of classifcations and
> need to generate a summary for each one of them. I need to create
> these summaries as quick as possible because of the amount that I need
> to create. Can anyone help me out with this, or am I doing something
> wrong in my approach.
>
> Cheers
>
>
>
> On Thu, Nov 13, 2008 at 2:49 PM, Whitby, Rob, CMG
> <[EMAIL PROTECTED]> wrote:
>> The query I wrote loops through all the values in the index, and
>> cts:frequency returns how many fragments the value occurs in. I'm not
>> clear exactly what you're trying to do, but the start of your query can
>> use the index:
>>
>>> let $allKeywords := fn:collection()/doc/keywords/keyword/@value
>>> let $distinct := fn:distinct-values($allKeywords)
>>>
>>> for $k in $distinct
>>
>> can be replaced with:
>>
>> for $k in cts:element-attribute-values(xs:QName("keyword",
>> xs:QName("value"))
>>
>>
>> Hope this helps!
>>
>>
>>
>> -----Original Message-----
>> From: Steve [mailto:[EMAIL PROTECTED]
>> Sent: 13 November 2008 10:34
>> To: Whitby, Rob, CMG
>> Cc: General Mark Logic Developer Discussion
>> Subject: Re: [MarkLogic Dev General] Improving XPathPerformance
>> onSearchResults
>>
>> I've looked at the query and tried to execute it, but I get an error
>> telling me that the argument passed to cts:frequency(..) is not of type
>> item(). And looking at the query, won't that return the count of
>> <keyword> elements rather than the sum of the count attributes of each
>> distinct keyword element?
>>
>> On Thu, Nov 13, 2008 at 10:22 AM, Whitby, Rob, CMG
>> <[EMAIL PROTECTED]> wrote:
>>> This is where an index really is useful..
>>>
>>> for $keyword in cts:element-attribute-values(xs:Qname("keyword",
>>> xs:Qname("value"))
>>> let $count := cts:frequency($keyword)
>>> order by $count descending
>>> return <keyword value="{$keyword}" count="{$count}"/>
>>>
>>>
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: [EMAIL PROTECTED]
>>> [mailto:[EMAIL PROTECTED] On Behalf Of Steve
>>> Sent: 13 November 2008 10:13
>>> To: Michael Blakeley
>>> Cc: General Mark Logic Developer Discussion
>>> Subject: Re: [MarkLogic Dev General] Improving XPathPerformance
>>> onSearchResults
>>>
>>> Sorry if I seem to keep moving the goalposts. I've been playing with
>>> some of the parameters, and experimenting with my XQuery and the best
>>> performance I can get is using the following query. Unfortunately, it
>>> still takes a considerable amount of time to execute.  Profiling shows
>>
>>> that the XPath $nodes/@count is taking the time.
>>>
>>> let $allKeywords := fn:collection()/doc/keywords/keyword/@value
>>> let $distinct := fn:distinct-values($allKeywords)
>>>
>>> for $k in $distinct
>>>  return
>>>    let $search := cts:element-attribute-word-query(fn:QName("",
>>> "keyword"), fn:QName("", "value"), $k)
>>>    let $results := cts:search(fn:collection()/doc/keywords/keyword,
>>> $search)
>>>    let $nodes := [EMAIL PROTECTED] eq $k]
>>>    let $counts := $nodes/@count
>>>       return
>>>          <keyword value="{$k}" total="{fn:sum($counts)}" />
>>>
>>>
>>> On Thu, Nov 13, 2008 at 9:21 AM, Steve <[EMAIL PROTECTED]>
>>> wrote:
>>>> I've been applying what I've learnt so far from this thread, but I'm
>>>> having a bit of trouble getting good performance when I put it all
>>>> together. The query I'm trying to execute in order to get the sum of
>>>> a
>>>
>>>> count of keywords is below:
>>>>
>>>> let $allKeywords := fn:collection()/doc/keywords/keyword/@value
>>>> let $distinct := fn:distinct-values($allKeywords)
>>>>
>>>> for $k in $distinct
>>>>  return
>>>>    let $search := cts:element-attribute-word-query(fn:QName("",
>>>> "keyword"), fn:QName("", "value"), $k)
>>>>    let $results := cts:search(fn:collection()/doc/keywords/keyword,
>>>> $search)/@count
>>>>       return
>>>> fn:sum($results)
>>>>
>>>> Running profiling on the query shows me that it's the XPath stuff I
>>>> do
>>>
>>>> on the search results that's holding everything up, can anyone advise
>>
>>>> how I can improve this?
>>>>
>>>> Thankl
>>>>
>>>> On Wed, Nov 12, 2008 at 7:24 PM, Michael Blakeley
>>>> <[EMAIL PROTECTED]> wrote:
>>>>> To be fair, absorbing the architecture and indexing behavior of a
>>>>> modern RDBMS isn't trivial either. XML content adds another
>>>>> dimension, but I hope you find the performance guide at
>>>>> http://developer.marklogic.com/pubs/4.0/
>>>>> helpful. There are also useful bits of server architecture
>>>>> discussion
>>>
>>>>> in the dev and admin guides.
>>>>>
>>>>> In the general case I wouldn't expect adding a range index to
>>>>> greatly
>>>
>>>>> improve value query performance. The list cache is pretty efficient
>>>>> at keeping frequently-used terms in memory.
>>>>>
>>>>> Usually the range indexes are created for applications that need
>>>>> particular
>>>>> features: fast sorting on a node value, fast range queries, fast
>>>>> access to distinct values, etc.
>>>>>
>>>>> -- Mike
>>>>>
>>>>> Whitby, Rob, CMG wrote:
>>>>>>
>>>>>> Wow, I didn't realise that. It will improve performance though
>>>>>> right? On a large database I assume the index of all XML elements
>>>>>> and attributes can't be held in memory.
>>>>>>
>>>>>> Understanding how the functions relate to the indexes is probably
>>>>>> one of areas I've found hardest with MarkLogic.
>>>>>>
>>>>>> Thanks
>>>>>> Rob
>>>>>>
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: [EMAIL PROTECTED]
>>>>>> [mailto:[EMAIL PROTECTED] On Behalf Of
>>>>>> Michael Blakeley
>>>>>> Sent: 12 November 2008 16:55
>>>>>> To: General Mark Logic Developer Discussion
>>>>>> Cc: [EMAIL PROTECTED]
>>>>>> Subject: Re: [MarkLogic Dev General] Improving XPathPerformance
>>>>>> onSearchResults
>>>>>>
>>>>>> Actually that query does *not* require any special indexes. The
>>>>>> server always indexes all XML element values and element-attribute
>>> values.
>>>>>>
>>>>>> You would only need an attribute range index for a fast "order by"
>>>>>> on keyword/@value, or for a cts:attribute-value-range-query term,
>>>>>> or
>>>
>>>>>> for
>>>>>> cts:element-attribute-values() and its associated functions.
>>>>>>
>>>>>> -- Mike
>>>>>>
>>>>>> Whitby, Rob, CMG wrote:
>>>>>>>
>>>>>>> If you put an attribute range index on keyword/@value you can do
>>>>>>> something like this:
>>>>>>>
>>>>>>> cts:search(
>>>>>>>  /doc/classifications/classification,
>>>>>>>  cts:element-attribute-value-query(xs:Qname("keyword",
>>>>>>> xs:Qname("value"), "something")
>>>>>>> )
>>>>>>>
>>>>>>> (untested!)
>>>>>>>
>>>>>>> Rob
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: [EMAIL PROTECTED]
>>>>>>> [mailto:[EMAIL PROTECTED] On Behalf Of
>>>>>>> Steve
>>>>>>> Sent: 12 November 2008 14:41
>>>>>>> To: James Clippinger
>>>>>>> Cc: General Mark Logic Developer Discussion
>>>>>>> Subject: Re: [MarkLogic Dev General] Improving XPath Performance
>>>>>>> onSearchResults
>>>>>>>
>>>>>>> I should probably add that I'm trying to extract all
>>>>>>> classification
>>>
>>>>>>> values for the documents that have a specific keyword value.
>>>>>>>
>>>>>>> On Wed, Nov 12, 2008 at 2:40 PM, Steve <[EMAIL PROTECTED]>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Thanks for your response.
>>>>>>>>
>>>>>>>> I've tried your suggestion and it doesn't really help. Looking at
>>
>>>>>>>> the
>>>>>>
>>>>>>>> profiling document, I can see that it's clearly the XPath on the
>>>>>>>> document results that is slowing the who thing down. Is there any
>>
>>>>>>>> other ways that I can improve this. I've included a sample
>>>>>>>> document (small), so you can see what I'm trying to achieve.
>>>>>>>>
>>>>>>>> <doc>
>>>>>>>>  <classifications>
>>>>>>>>   <classification value="123" />
>>>>>>>>   <classification value="324" />
>>>>>>>>  </classifications>
>>>>>>>>  <keywords>
>>>>>>>>   <keyword value="word1" />
>>>>>>>>   <keyword value="word2" />
>>>>>>>>
>>>>>>>>  </keywords>
>>>>>>>> </doc>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Nov 12, 2008 at 2:24 PM, James Clippinger
>>>>>>>> <[EMAIL PROTECTED]> wrote:
>>>>>>>>>
>>>>>>>>> Steve, your query is doing some heavyweight filtering for the
>>>>>>>>> XPath because it is doing two steps:
>>>>>>>>>
>>>>>>>>> 1) Execute the cts:search(): generate a list of all documents
>>>>>>>>> matching the query in relevance order.
>>>>>>>>>
>>>>>>>>> 2) Execute the XPath: reorder the documents into document order
>>>>>>>>> and find only those with /doc/classifications/classification
>>>>>>>>> elements, returning those classification elements.
>>>>>>>>>
>>>>>>>>> Since you are using XPath and thus returning results in document
>>
>>>>>>>>> order, you probably want to use cts:contains() in an XPath
>>>>>>>>> predicate
>>>>>>
>>>>>>>>> rather than cts:search().  cts:contains() in a rooted XPath
>>>>>>>>> expression will use the search indexes when appropriate, so it's
>>
>>>>>>>>> as fast as the equivalent
>>>>>>>>> cts:search() expression.  Try this:
>>>>>>>>>
>>>>>>>>> let $search := cts:element-attribute-word-query(fn:QName("",
>>>>>>>>> "keyword"), fn:QName("", "value"), "something") return
>>>>>>>>> fn:collection()/doc[cts:contains(.,
>>>>>>>>> $search)/classifications/classification
>>>>>>>>>
>>>>>>>>> James
>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: [EMAIL PROTECTED]
>>>>>>>>>> [mailto:[EMAIL PROTECTED] On Behalf Of
>>>>>>>>>> Steve
>>>>>>>>>> Sent: Wednesday, November 12, 2008 8:54 AM
>>>>>>>>>> To: [email protected]
>>>>>>>>>> Subject: [MarkLogic Dev General] Improving XPath Performance on
>>
>>>>>>>>>> SearchResults
>>>>>>>>>>
>>>>>>>>>> I've written a query which I use to search my data set and I am
>>
>>>>>>>>>> able to get the results back very quickly. However the results
>>>>>>>>>> that I get back show the complete document that the search
>>>>>>>>>> matched, where as I
>>>>>>
>>>>>>>>>> want a particular node returned.
>>>>>>>>>> At the moment I'm doing this:
>>>>>>>>>>
>>>>>>>>>> let $search := cts:element-attribute-word-query(fn:QName("",
>>>>>>>>>> "keyword"), fn:QName("", "value"), "something") let $results :=
>>
>>>>>>>>>> cts:search(fn:collection(),
>>> $search)/doc/classifications/classification
>>>>>>>>>>   return $results
>>>>>>>>>>
>>>>>>>>>> I've tried profiling this query and I've found that there is a
>>>>>>>>>> big lag filtering the $results of the search using XPath.
>>>>>>>>>> Is there any way, either through using a different query or
>>>>>>>>>> search notation, or by indexes etc that I can speed this up.
>>>>>>>>>>
>>>>>>>>>> Thanks in advance...
>>>>>>>>>> _______________________________________________
>>>>>>>>>> General mailing list
>>>>>>>>>> [email protected]
>>>>>>>>>> http://xqzone.com/mailman/listinfo/general
>>>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> General mailing list
>>>>>>> [email protected]
>>>>>>> http://xqzone.com/mailman/listinfo/general
>>>>>>> _______________________________________________
>>>>>>> General mailing list
>>>>>>> [email protected]
>>>>>>> http://xqzone.com/mailman/listinfo/general
>>>>>>
>>>>>> _______________________________________________
>>>>>> General mailing list
>>>>>> [email protected]
>>>>>> http://xqzone.com/mailman/listinfo/general
>>>>>> _______________________________________________
>>>>>> General mailing list
>>>>>> [email protected]
>>>>>> http://xqzone.com/mailman/listinfo/general
>>>>>
>>>>>
>>>>
>>> _______________________________________________
>>> General mailing list
>>> [email protected]
>>> http://xqzone.com/mailman/listinfo/general
>>>
>>
> _______________________________________________
> General mailing list
> [email protected]
> http://xqzone.com/mailman/listinfo/general
>
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Improving XPathPerformance onSearchResults

Reply via email to