[MarkLogic Dev General] Sorting by the number of occurences of a paragraph

Laurens van den Oever Mon, 27 Jul 2009 07:11:40 -0700

Hi all,
I'm pretty new to MarkLogic, so chances are that I've made some trivial
mistake here.


I have roughly the following structure:

<manual>
  <translation lang="..."><!-- no xml:lang due to legacy -->
    <!-- arbritary nesting of other elements -->
      <paragraph>

I have about 5000 manuals with on average 16 translations each, bringing the
total of distinct (!) paragraphs to 700000.
The goal is to stimulate content reuse from the authoring interface.
I want to show the authors about 10 paragraphs which contain a search phrase
and here it comes: ordered by the number of occurences of that paragraph in
the collection.
I assume that a distinct paragraph only occurs once in a translation.

I realize that I'm trying to achieve something close to impossible;
expecting fast results from a query that compares a large part of the db
against the whole db, but I'm amazed that I've come this far and I'd like to
see if I can get this to the next level.

I started with the following query:

 (for $para in cts:search(//paragraph,
cts:element-word-query(xs:QName("paragraph"), "search phrase"))
  let $count := xdmp:estimate(cts:search(//paragraph,
cts:element-word-query(xs:QName("paragraph"), $para)))
  order by number($count) descending
  return
  <result count="{$count}">
    {$para}
  </result>
  )[1 to 10]

There are two problems with this approach:
1. it is far too slow
2. it returns multiple occurrences of the same content

I've been able to improve performance with the following measures:
- Maximizing the number of initial search results.
- Refragmenting the database on <translation/> level.
- Made <paragraph/> the root of a field.
- Reduced the scope of the query to one language using a [...@lang="EN"]
predicate but that slowed things down.
- Simple scoring improved performance and accuracy as relevance seems to
contradict my quest for the most occurences.

To eliminate the multiple occurrences I've used fn:distinct-values, but the
downside is that it returns a string and I need the paragraph element
including all markup.
Now my new query is:

 (for $p in fn:distinct-values(
    cts:search(
      /manual/translation//paragraph,
      cts:field-word-query("paragraph", "search query"),
      ("score-simple"))[1 to 250])
  let $count := xdmp:estimate(
    cts:search(
      /manual/translation//paragraph,
      cts:field-word-query("paragraph", $p),
      ("score-simple")))
  order by number($count) descending
  return <result count="{$count}">{$p}</result>
)[1 to 10]

This is often very fast, but can take far too long if I happen to hit a
batch of documents/fragments that weren't hit recently.

Is there more I can do here?
Or is there a completely different aproach that may yield better results?
And how do I get mixed content results?

Thanks for reading through all this!

Kind regards,

Laurens van den Oever
Xopus BV

http://xopus.com
+31 70 4452345
KvK 27301795

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

[MarkLogic Dev General] Sorting by the number of occurences of a paragraph

Reply via email to