There's about 1200 docs in a classification, with approximately something like 500 keywords each.
On Thu, Nov 13, 2008 at 3:54 PM, Frank Sanders <[EMAIL PROTECTED]> wrote: > Steve, > > About how many documents are you actually talking about? > > -fs > > -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Steve > Sent: Thursday, November 13, 2008 10:04 AM > To: Whitby, Rob, CMG > Cc: General Mark Logic Developer Discussion > Subject: Re: [MarkLogic Dev General] Improving XPathPerformance > onSearchResults > > Ok, I'll try and explain a little bit better what I'm trying to do. > > I've got a set of documents all in the following formats: > > <doc> > <classifications> > <classification value="234" /> > <classification value="543" /> > </classifications> > <keywords> > <keyword value="term1" count="23" /> > <keyword value="term2" count="43" /> > </keywords> > </doc> > > Each document belongs to a set of collections. All belong to the > collection "collections/docs". This has been used to separate this > data from any other data of a different type that I might want to put > into the database. The document also belongs to collections that are > representative of their classifications. The idea of this was to > enable me to quickly filter the documents based on classification, by > choosing to execute a query against a classification. For example, in > the case of the document above, it would belong to the following > collections: > > "collections/docs" > "collections/class/234" > "collections/class/543" > > Now, what I want to do, is to create a summary document for a > classification. This summary document needs to take the following > format. > > <summary> > <classification value="234" /> > <keywords> > <keyword value="term1" count="123" /> > <keyword value="term2" count="231" /> > </keywords> > <summary> > > The count attribute is the sum of the count values from the documents > that exist in the classification for that term. > > I've got a stack of original documents and a lot of classifcations and > need to generate a summary for each one of them. I need to create > these summaries as quick as possible because of the amount that I need > to create. Can anyone help me out with this, or am I doing something > wrong in my approach. > > Cheers > > > > On Thu, Nov 13, 2008 at 2:49 PM, Whitby, Rob, CMG > <[EMAIL PROTECTED]> wrote: >> The query I wrote loops through all the values in the index, and >> cts:frequency returns how many fragments the value occurs in. I'm not >> clear exactly what you're trying to do, but the start of your query can >> use the index: >> >>> let $allKeywords := fn:collection()/doc/keywords/keyword/@value >>> let $distinct := fn:distinct-values($allKeywords) >>> >>> for $k in $distinct >> >> can be replaced with: >> >> for $k in cts:element-attribute-values(xs:QName("keyword", >> xs:QName("value")) >> >> >> Hope this helps! >> >> >> >> -----Original Message----- >> From: Steve [mailto:[EMAIL PROTECTED] >> Sent: 13 November 2008 10:34 >> To: Whitby, Rob, CMG >> Cc: General Mark Logic Developer Discussion >> Subject: Re: [MarkLogic Dev General] Improving XPathPerformance >> onSearchResults >> >> I've looked at the query and tried to execute it, but I get an error >> telling me that the argument passed to cts:frequency(..) is not of type >> item(). And looking at the query, won't that return the count of >> <keyword> elements rather than the sum of the count attributes of each >> distinct keyword element? >> >> On Thu, Nov 13, 2008 at 10:22 AM, Whitby, Rob, CMG >> <[EMAIL PROTECTED]> wrote: >>> This is where an index really is useful.. >>> >>> for $keyword in cts:element-attribute-values(xs:Qname("keyword", >>> xs:Qname("value")) >>> let $count := cts:frequency($keyword) >>> order by $count descending >>> return <keyword value="{$keyword}" count="{$count}"/> >>> >>> >>> >>> >>> >>> >>> -----Original Message----- >>> From: [EMAIL PROTECTED] >>> [mailto:[EMAIL PROTECTED] On Behalf Of Steve >>> Sent: 13 November 2008 10:13 >>> To: Michael Blakeley >>> Cc: General Mark Logic Developer Discussion >>> Subject: Re: [MarkLogic Dev General] Improving XPathPerformance >>> onSearchResults >>> >>> Sorry if I seem to keep moving the goalposts. I've been playing with >>> some of the parameters, and experimenting with my XQuery and the best >>> performance I can get is using the following query. Unfortunately, it >>> still takes a considerable amount of time to execute. Profiling shows >> >>> that the XPath $nodes/@count is taking the time. >>> >>> let $allKeywords := fn:collection()/doc/keywords/keyword/@value >>> let $distinct := fn:distinct-values($allKeywords) >>> >>> for $k in $distinct >>> return >>> let $search := cts:element-attribute-word-query(fn:QName("", >>> "keyword"), fn:QName("", "value"), $k) >>> let $results := cts:search(fn:collection()/doc/keywords/keyword, >>> $search) >>> let $nodes := [EMAIL PROTECTED] eq $k] >>> let $counts := $nodes/@count >>> return >>> <keyword value="{$k}" total="{fn:sum($counts)}" /> >>> >>> >>> On Thu, Nov 13, 2008 at 9:21 AM, Steve <[EMAIL PROTECTED]> >>> wrote: >>>> I've been applying what I've learnt so far from this thread, but I'm >>>> having a bit of trouble getting good performance when I put it all >>>> together. The query I'm trying to execute in order to get the sum of >>>> a >>> >>>> count of keywords is below: >>>> >>>> let $allKeywords := fn:collection()/doc/keywords/keyword/@value >>>> let $distinct := fn:distinct-values($allKeywords) >>>> >>>> for $k in $distinct >>>> return >>>> let $search := cts:element-attribute-word-query(fn:QName("", >>>> "keyword"), fn:QName("", "value"), $k) >>>> let $results := cts:search(fn:collection()/doc/keywords/keyword, >>>> $search)/@count >>>> return >>>> fn:sum($results) >>>> >>>> Running profiling on the query shows me that it's the XPath stuff I >>>> do >>> >>>> on the search results that's holding everything up, can anyone advise >> >>>> how I can improve this? >>>> >>>> Thankl >>>> >>>> On Wed, Nov 12, 2008 at 7:24 PM, Michael Blakeley >>>> <[EMAIL PROTECTED]> wrote: >>>>> To be fair, absorbing the architecture and indexing behavior of a >>>>> modern RDBMS isn't trivial either. XML content adds another >>>>> dimension, but I hope you find the performance guide at >>>>> http://developer.marklogic.com/pubs/4.0/ >>>>> helpful. There are also useful bits of server architecture >>>>> discussion >>> >>>>> in the dev and admin guides. >>>>> >>>>> In the general case I wouldn't expect adding a range index to >>>>> greatly >>> >>>>> improve value query performance. The list cache is pretty efficient >>>>> at keeping frequently-used terms in memory. >>>>> >>>>> Usually the range indexes are created for applications that need >>>>> particular >>>>> features: fast sorting on a node value, fast range queries, fast >>>>> access to distinct values, etc. >>>>> >>>>> -- Mike >>>>> >>>>> Whitby, Rob, CMG wrote: >>>>>> >>>>>> Wow, I didn't realise that. It will improve performance though >>>>>> right? On a large database I assume the index of all XML elements >>>>>> and attributes can't be held in memory. >>>>>> >>>>>> Understanding how the functions relate to the indexes is probably >>>>>> one of areas I've found hardest with MarkLogic. >>>>>> >>>>>> Thanks >>>>>> Rob >>>>>> >>>>>> >>>>>> >>>>>> -----Original Message----- >>>>>> From: [EMAIL PROTECTED] >>>>>> [mailto:[EMAIL PROTECTED] On Behalf Of >>>>>> Michael Blakeley >>>>>> Sent: 12 November 2008 16:55 >>>>>> To: General Mark Logic Developer Discussion >>>>>> Cc: [EMAIL PROTECTED] >>>>>> Subject: Re: [MarkLogic Dev General] Improving XPathPerformance >>>>>> onSearchResults >>>>>> >>>>>> Actually that query does *not* require any special indexes. The >>>>>> server always indexes all XML element values and element-attribute >>> values. >>>>>> >>>>>> You would only need an attribute range index for a fast "order by" >>>>>> on keyword/@value, or for a cts:attribute-value-range-query term, >>>>>> or >>> >>>>>> for >>>>>> cts:element-attribute-values() and its associated functions. >>>>>> >>>>>> -- Mike >>>>>> >>>>>> Whitby, Rob, CMG wrote: >>>>>>> >>>>>>> If you put an attribute range index on keyword/@value you can do >>>>>>> something like this: >>>>>>> >>>>>>> cts:search( >>>>>>> /doc/classifications/classification, >>>>>>> cts:element-attribute-value-query(xs:Qname("keyword", >>>>>>> xs:Qname("value"), "something") >>>>>>> ) >>>>>>> >>>>>>> (untested!) >>>>>>> >>>>>>> Rob >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: [EMAIL PROTECTED] >>>>>>> [mailto:[EMAIL PROTECTED] On Behalf Of >>>>>>> Steve >>>>>>> Sent: 12 November 2008 14:41 >>>>>>> To: James Clippinger >>>>>>> Cc: General Mark Logic Developer Discussion >>>>>>> Subject: Re: [MarkLogic Dev General] Improving XPath Performance >>>>>>> onSearchResults >>>>>>> >>>>>>> I should probably add that I'm trying to extract all >>>>>>> classification >>> >>>>>>> values for the documents that have a specific keyword value. >>>>>>> >>>>>>> On Wed, Nov 12, 2008 at 2:40 PM, Steve <[EMAIL PROTECTED]> >>>>>>> wrote: >>>>>>>> >>>>>>>> Thanks for your response. >>>>>>>> >>>>>>>> I've tried your suggestion and it doesn't really help. Looking at >> >>>>>>>> the >>>>>> >>>>>>>> profiling document, I can see that it's clearly the XPath on the >>>>>>>> document results that is slowing the who thing down. Is there any >> >>>>>>>> other ways that I can improve this. I've included a sample >>>>>>>> document (small), so you can see what I'm trying to achieve. >>>>>>>> >>>>>>>> <doc> >>>>>>>> <classifications> >>>>>>>> <classification value="123" /> >>>>>>>> <classification value="324" /> >>>>>>>> </classifications> >>>>>>>> <keywords> >>>>>>>> <keyword value="word1" /> >>>>>>>> <keyword value="word2" /> >>>>>>>> >>>>>>>> </keywords> >>>>>>>> </doc> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Nov 12, 2008 at 2:24 PM, James Clippinger >>>>>>>> <[EMAIL PROTECTED]> wrote: >>>>>>>>> >>>>>>>>> Steve, your query is doing some heavyweight filtering for the >>>>>>>>> XPath because it is doing two steps: >>>>>>>>> >>>>>>>>> 1) Execute the cts:search(): generate a list of all documents >>>>>>>>> matching the query in relevance order. >>>>>>>>> >>>>>>>>> 2) Execute the XPath: reorder the documents into document order >>>>>>>>> and find only those with /doc/classifications/classification >>>>>>>>> elements, returning those classification elements. >>>>>>>>> >>>>>>>>> Since you are using XPath and thus returning results in document >> >>>>>>>>> order, you probably want to use cts:contains() in an XPath >>>>>>>>> predicate >>>>>> >>>>>>>>> rather than cts:search(). cts:contains() in a rooted XPath >>>>>>>>> expression will use the search indexes when appropriate, so it's >> >>>>>>>>> as fast as the equivalent >>>>>>>>> cts:search() expression. Try this: >>>>>>>>> >>>>>>>>> let $search := cts:element-attribute-word-query(fn:QName("", >>>>>>>>> "keyword"), fn:QName("", "value"), "something") return >>>>>>>>> fn:collection()/doc[cts:contains(., >>>>>>>>> $search)/classifications/classification >>>>>>>>> >>>>>>>>> James >>>>>>>>> >>>>>>>>>> -----Original Message----- >>>>>>>>>> From: [EMAIL PROTECTED] >>>>>>>>>> [mailto:[EMAIL PROTECTED] On Behalf Of >>>>>>>>>> Steve >>>>>>>>>> Sent: Wednesday, November 12, 2008 8:54 AM >>>>>>>>>> To: [email protected] >>>>>>>>>> Subject: [MarkLogic Dev General] Improving XPath Performance on >> >>>>>>>>>> SearchResults >>>>>>>>>> >>>>>>>>>> I've written a query which I use to search my data set and I am >> >>>>>>>>>> able to get the results back very quickly. However the results >>>>>>>>>> that I get back show the complete document that the search >>>>>>>>>> matched, where as I >>>>>> >>>>>>>>>> want a particular node returned. >>>>>>>>>> At the moment I'm doing this: >>>>>>>>>> >>>>>>>>>> let $search := cts:element-attribute-word-query(fn:QName("", >>>>>>>>>> "keyword"), fn:QName("", "value"), "something") let $results := >> >>>>>>>>>> cts:search(fn:collection(), >>> $search)/doc/classifications/classification >>>>>>>>>> return $results >>>>>>>>>> >>>>>>>>>> I've tried profiling this query and I've found that there is a >>>>>>>>>> big lag filtering the $results of the search using XPath. >>>>>>>>>> Is there any way, either through using a different query or >>>>>>>>>> search notation, or by indexes etc that I can speed this up. >>>>>>>>>> >>>>>>>>>> Thanks in advance... >>>>>>>>>> _______________________________________________ >>>>>>>>>> General mailing list >>>>>>>>>> [email protected] >>>>>>>>>> http://xqzone.com/mailman/listinfo/general >>>>>>>>>> >>>>>>> _______________________________________________ >>>>>>> General mailing list >>>>>>> [email protected] >>>>>>> http://xqzone.com/mailman/listinfo/general >>>>>>> _______________________________________________ >>>>>>> General mailing list >>>>>>> [email protected] >>>>>>> http://xqzone.com/mailman/listinfo/general >>>>>> >>>>>> _______________________________________________ >>>>>> General mailing list >>>>>> [email protected] >>>>>> http://xqzone.com/mailman/listinfo/general >>>>>> _______________________________________________ >>>>>> General mailing list >>>>>> [email protected] >>>>>> http://xqzone.com/mailman/listinfo/general >>>>> >>>>> >>>> >>> _______________________________________________ >>> General mailing list >>> [email protected] >>> http://xqzone.com/mailman/listinfo/general >>> >> > _______________________________________________ > General mailing list > [email protected] > http://xqzone.com/mailman/listinfo/general > _______________________________________________ General mailing list [email protected] http://xqzone.com/mailman/listinfo/general
