RE: [MarkLogic Dev General] Sorting by the number of occurences of a paragraph

Geert Josten Mon, 27 Jul 2009 11:46:07 -0700

Hi,

I think you'll find everything you need in the excellent answer of Kelly..


;-)

Geert

> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of 
> Laurens van den Oever
> Sent: maandag 27 juli 2009 18:24
> To: general
> Subject: RE: [MarkLogic Dev General] Sorting by the number of 
> occurences of a paragraph
> 
> Hi Geert,
> 
> Thanks for your response, your input is certainly valuable. 
> I'll let you know about the results.
> 
> > Thirdly, you select top ten on the outside of the for-loop. 
> If it is 
> > possible to get that in the for expression of your for- loop, that 
> > should speed things up much as well
> 
> Is there a common pattern to do that? I need the top 10 items 
> after the order by.
> 
> Kind regards,
> 
> 
> Laurens van den Oever
> Xopus BV
> 
> 
> http://xopus.com <http://xopus.com/> 
> +31 70 4452345
> KvK 27301795
> 
> Date: Mon, 27 Jul 2009 16:26:00 +0200
> From: Geert Josten <[email protected]>
> Subject: RE: [MarkLogic Dev General] Sorting by the number of
>        occurences of   a       paragraph
> To: General Mark Logic Developer Discussion
>        <[email protected]>
> Message-ID:
>        
> <0260356c6dfe754ba6fa48e659a14338269cae7...@helios.olympus.borgus.nl>
> Content-Type: text/plain; charset="Windows-1252"
> 
> 
> Hi Laurens,
> 
> Have you looked into the cts:element-values and related 
> functions? These are purely based on the MarkLogic Server 
> indexes and are by far quicker than calls to distinct-values.
> 
> And not sure if it makes difference, but you could also use 
> cts:remainder instead of xdmp:estimate with a search as argument.
> 
> Thirdly, you select top ten on the outside of the for-loop. 
> If it is possible to get that in the for expression of your 
> for-loop, that should speed things up much as well.
> 
> Your statements about timings seem to indicate your 
> performance is relying on caching within MarkLogic Server, 
> but using index based functions only makes caching unnecessary..
> 
> HTH,
> Geert
> 
> >
> 
> 
> Drs. G.P.H. Josten
> Consultant
> 
> 
> http://www.daidalos.nl/
> Daidalos BV
> Source of Innovation
> Hoekeindsehof 1-4
> 2665 JZ Bleiswijk
> Tel.: +31 (0) 10 850 1200
> Fax: +31 (0) 10 850 1199
> http://www.daidalos.nl/
> KvK 27164984
> De informatie - verzonden in of met dit emailbericht - is 
> afkomstig van Daidalos BV en is uitsluitend bestemd voor de 
> geadresseerde. Indien u dit bericht onbedoeld hebt ontvangen, 
> verzoeken wij u het te verwijderen. Aan dit bericht kunnen 
> geen rechten worden ontleend.
> 
> 
> > From: [email protected]
> > [mailto:[email protected]] On Behalf 
> Of Laurens 
> > van den Oever
> > Sent: maandag 27 juli 2009 16:11
> > To: [email protected]
> > Subject: [MarkLogic Dev General] Sorting by the number of 
> occurences 
> > of a paragraph
> >
> > Hi all,
> >
> > I'm pretty new to MarkLogic, so chances are that I've made some 
> > trivial mistake here.
> >
> >
> > I have roughly the following structure:
> >
> > <manual>
> >   <translation lang="..."><!-- no xml:lang due to legacy -->
> >
> >     <!-- arbritary nesting of other elements -->
> >       <paragraph>
> >
> > I have about 5000 manuals with on average 16 translations each, 
> > bringing the total of distinct (!) paragraphs to 700000.
> > The goal is to stimulate content reuse from the authoring interface.
> > I want to show the authors about 10 paragraphs which 
> contain a search 
> > phrase and here it comes: ordered by the number of 
> occurences of that 
> > paragraph in the collection.
> > I assume that a distinct paragraph only occurs once in a 
> translation.
> >
> > I realize that I'm trying to achieve something close to impossible; 
> > expecting fast results from a query that compares a large 
> part of the 
> > db against the whole db, but I'm amazed that I've come this far and 
> > I'd like to see if I can get this to the next level.
> >
> > I started with the following query:
> >
> >  (for $para in cts:search(//paragraph, 
> > cts:element-word-query(xs:QName("paragraph"), "search phrase"))
> >   let $count := xdmp:estimate(cts:search(//paragraph,
> > cts:element-word-query(xs:QName("paragraph"), $para)))
> >   order by number($count) descending
> >   return
> >   <result count="{$count}">
> >     {$para}
> >   </result>
> >   )[1 to 10]
> >
> > There are two problems with this approach:
> > 1. it is far too slow
> > 2. it returns multiple occurrences of the same content
> >
> > I've been able to improve performance with the following measures:
> > - Maximizing the number of initial search results.
> > - Refragmenting the database on <translation/> level.
> > - Made <paragraph/> the root of a field.
> > - Reduced the scope of the query to one language using a 
> [...@lang="EN"] 
> > predicate but that slowed things down.
> > - Simple scoring improved performance and accuracy as 
> relevance seems 
> > to contradict my quest for the most occurences.
> >
> > To eliminate the multiple occurrences I've used fn:distinct-values, 
> > but the downside is that it returns a string and I need the 
> paragraph 
> > element including all markup.
> > Now my new query is:
> >
> >  (for $p in fn:distinct-values(
> >     cts:search(
> >       /manual/translation//paragraph,
> >       cts:field-word-query("paragraph", "search query"),
> >       ("score-simple"))[1 to 250])
> >   let $count := xdmp:estimate(
> >     cts:search(
> >       /manual/translation//paragraph,
> >       cts:field-word-query("paragraph", $p),
> >       ("score-simple")))
> >   order by number($count) descending
> >   return <result count="{$count}">{$p}</result>
> > )[1 to 10]
> >
> > This is often very fast, but can take far too long if I 
> happen to hit 
> > a batch of documents/fragments that weren't hit recently.
> >
> > Is there more I can do here?
> > Or is there a completely different aproach that may yield better 
> > results?
> > And how do I get mixed content results?
> >
> > Thanks for reading through all this!
> >
> > Kind regards,
> >
> > Laurens van den Oever
> > Xopus BV
> >
> > http://xopus.com <http://xopus.com/>
> > +31 70 4452345
> > KvK 27301795
> >
> > 
> _______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

RE: [MarkLogic Dev General] Sorting by the number of occurences of a paragraph

Reply via email to