Hi, I think you'll find everything you need in the excellent answer of Kelly..
;-) Geert > -----Original Message----- > From: [email protected] > [mailto:[email protected]] On Behalf Of > Laurens van den Oever > Sent: maandag 27 juli 2009 18:24 > To: general > Subject: RE: [MarkLogic Dev General] Sorting by the number of > occurences of a paragraph > > Hi Geert, > > Thanks for your response, your input is certainly valuable. > I'll let you know about the results. > > > Thirdly, you select top ten on the outside of the for-loop. > If it is > > possible to get that in the for expression of your for- loop, that > > should speed things up much as well > > Is there a common pattern to do that? I need the top 10 items > after the order by. > > Kind regards, > > > Laurens van den Oever > Xopus BV > > > http://xopus.com <http://xopus.com/> > +31 70 4452345 > KvK 27301795 > > Date: Mon, 27 Jul 2009 16:26:00 +0200 > From: Geert Josten <[email protected]> > Subject: RE: [MarkLogic Dev General] Sorting by the number of > occurences of a paragraph > To: General Mark Logic Developer Discussion > <[email protected]> > Message-ID: > > <0260356c6dfe754ba6fa48e659a14338269cae7...@helios.olympus.borgus.nl> > Content-Type: text/plain; charset="Windows-1252" > > > Hi Laurens, > > Have you looked into the cts:element-values and related > functions? These are purely based on the MarkLogic Server > indexes and are by far quicker than calls to distinct-values. > > And not sure if it makes difference, but you could also use > cts:remainder instead of xdmp:estimate with a search as argument. > > Thirdly, you select top ten on the outside of the for-loop. > If it is possible to get that in the for expression of your > for-loop, that should speed things up much as well. > > Your statements about timings seem to indicate your > performance is relying on caching within MarkLogic Server, > but using index based functions only makes caching unnecessary.. > > HTH, > Geert > > > > > > Drs. G.P.H. Josten > Consultant > > > http://www.daidalos.nl/ > Daidalos BV > Source of Innovation > Hoekeindsehof 1-4 > 2665 JZ Bleiswijk > Tel.: +31 (0) 10 850 1200 > Fax: +31 (0) 10 850 1199 > http://www.daidalos.nl/ > KvK 27164984 > De informatie - verzonden in of met dit emailbericht - is > afkomstig van Daidalos BV en is uitsluitend bestemd voor de > geadresseerde. Indien u dit bericht onbedoeld hebt ontvangen, > verzoeken wij u het te verwijderen. Aan dit bericht kunnen > geen rechten worden ontleend. > > > > From: [email protected] > > [mailto:[email protected]] On Behalf > Of Laurens > > van den Oever > > Sent: maandag 27 juli 2009 16:11 > > To: [email protected] > > Subject: [MarkLogic Dev General] Sorting by the number of > occurences > > of a paragraph > > > > Hi all, > > > > I'm pretty new to MarkLogic, so chances are that I've made some > > trivial mistake here. > > > > > > I have roughly the following structure: > > > > <manual> > > <translation lang="..."><!-- no xml:lang due to legacy --> > > > > <!-- arbritary nesting of other elements --> > > <paragraph> > > > > I have about 5000 manuals with on average 16 translations each, > > bringing the total of distinct (!) paragraphs to 700000. > > The goal is to stimulate content reuse from the authoring interface. > > I want to show the authors about 10 paragraphs which > contain a search > > phrase and here it comes: ordered by the number of > occurences of that > > paragraph in the collection. > > I assume that a distinct paragraph only occurs once in a > translation. > > > > I realize that I'm trying to achieve something close to impossible; > > expecting fast results from a query that compares a large > part of the > > db against the whole db, but I'm amazed that I've come this far and > > I'd like to see if I can get this to the next level. > > > > I started with the following query: > > > > (for $para in cts:search(//paragraph, > > cts:element-word-query(xs:QName("paragraph"), "search phrase")) > > let $count := xdmp:estimate(cts:search(//paragraph, > > cts:element-word-query(xs:QName("paragraph"), $para))) > > order by number($count) descending > > return > > <result count="{$count}"> > > {$para} > > </result> > > )[1 to 10] > > > > There are two problems with this approach: > > 1. it is far too slow > > 2. it returns multiple occurrences of the same content > > > > I've been able to improve performance with the following measures: > > - Maximizing the number of initial search results. > > - Refragmenting the database on <translation/> level. > > - Made <paragraph/> the root of a field. > > - Reduced the scope of the query to one language using a > [...@lang="EN"] > > predicate but that slowed things down. > > - Simple scoring improved performance and accuracy as > relevance seems > > to contradict my quest for the most occurences. > > > > To eliminate the multiple occurrences I've used fn:distinct-values, > > but the downside is that it returns a string and I need the > paragraph > > element including all markup. > > Now my new query is: > > > > (for $p in fn:distinct-values( > > cts:search( > > /manual/translation//paragraph, > > cts:field-word-query("paragraph", "search query"), > > ("score-simple"))[1 to 250]) > > let $count := xdmp:estimate( > > cts:search( > > /manual/translation//paragraph, > > cts:field-word-query("paragraph", $p), > > ("score-simple"))) > > order by number($count) descending > > return <result count="{$count}">{$p}</result> > > )[1 to 10] > > > > This is often very fast, but can take far too long if I > happen to hit > > a batch of documents/fragments that weren't hit recently. > > > > Is there more I can do here? > > Or is there a completely different aproach that may yield better > > results? > > And how do I get mixed content results? > > > > Thanks for reading through all this! > > > > Kind regards, > > > > Laurens van den Oever > > Xopus BV > > > > http://xopus.com <http://xopus.com/> > > +31 70 4452345 > > KvK 27301795 > > > > > _______________________________________________ General mailing list [email protected] http://xqzone.com/mailman/listinfo/general
