[MarkLogic Dev General] RE: Sorting by the number of occurences of a paragraph

Kelly Stirman Mon, 27 Jul 2009 10:34:18 -0700

Hi Laurent,

If I follow your design correctly, what I would do is the following:


1) iterate over all your paragraphs and use xdmp:md5() to generate a hash value
2) add this hash value as an attribute to each paragraph, e.g. <paragraph 
hash-id="abc123">hello world</paragraph>
3) create a string range index in the codepoint collation on the 
paragraph/@hash-id attribute

Then to return paragraphs in frequency order, you can call 
cts:element-attribute-values(xs:QName("paragraph"),xs:QName("hash-id"),(),"item-frequency").
 You can filter this list with any search expression by adding another the 
cts:query as another option (see below).

This approach allows you to quickly get the hash-id in frequency order, with or 
without a cts:query. You'll then need to go get a paragraph that matches the 
hash-id. Because there may be many, you can simply grab the first.


let $q:= "search phrase"
for $id in 
cts:element-attribute-values(xs:QName("paragraph"),xs:QName("hash-id"),(),"item-frequency",$q)
return element result {attribute count 
{cts:frequency($id)},(//paragra...@hash-id eq $id])[1]}

Finally, before doing any of this, I would get rid of your fragmentation. You 
probably don't need fields, but we can continue to talk about how they might be 
useful for this task. I also don't think you need to limit to a specific 
language, but that shouldn't slow things down if you want to use it (be sure to 
look over our developer guide on using languages, and your server license *may* 
come into play on this subject).

This should be very fast - well under a second as long as there aren't too many 
paragraphs being returned. Getting the hash-ids will be resolved out of the 
indexes, whereas each paragraph returned will incur a disk i/o. 100 or so 
results should be sub-second.

Kelly


Message: 4
Date: Mon, 27 Jul 2009 16:11:16 +0200
From: Laurens van den Oever <[email protected]>
Subject: [MarkLogic Dev General] Sorting by the number of occurences
        of a    paragraph
To: [email protected]
Message-ID:
        <[email protected]>
Content-Type: text/plain; charset="iso-8859-1"

Hi all,
I'm pretty new to MarkLogic, so chances are that I've made some trivial
mistake here.

I have roughly the following structure:

<manual>
  <translation lang="..."><!-- no xml:lang due to legacy -->
    <!-- arbritary nesting of other elements -->
      <paragraph>

I have about 5000 manuals with on average 16 translations each, bringing the
total of distinct (!) paragraphs to 700000.
The goal is to stimulate content reuse from the authoring interface.
I want to show the authors about 10 paragraphs which contain a search phrase
and here it comes: ordered by the number of occurences of that paragraph in
the collection.
I assume that a distinct paragraph only occurs once in a translation.

I realize that I'm trying to achieve something close to impossible;
expecting fast results from a query that compares a large part of the db
against the whole db, but I'm amazed that I've come this far and I'd like to
see if I can get this to the next level.

I started with the following query:

 (for $para in cts:search(//paragraph,
cts:element-word-query(xs:QName("paragraph"), "search phrase"))
  let $count := xdmp:estimate(cts:search(//paragraph,
cts:element-word-query(xs:QName("paragraph"), $para)))
  order by number($count) descending
  return
  <result count="{$count}">
    {$para}
  </result>
  )[1 to 10]

There are two problems with this approach:
1. it is far too slow
2. it returns multiple occurrences of the same content

I've been able to improve performance with the following measures:
- Maximizing the number of initial search results.
- Refragmenting the database on <translation/> level.
- Made <paragraph/> the root of a field.
- Reduced the scope of the query to one language using a [...@lang="EN"]
predicate but that slowed things down.
- Simple scoring improved performance and accuracy as relevance seems to
contradict my quest for the most occurences.

To eliminate the multiple occurrences I've used fn:distinct-values, but the
downside is that it returns a string and I need the paragraph element
including all markup.
Now my new query is:

 (for $p in fn:distinct-values(
    cts:search(
      /manual/translation//paragraph,
      cts:field-word-query("paragraph", "search query"),
      ("score-simple"))[1 to 250])
  let $count := xdmp:estimate(
    cts:search(
      /manual/translation//paragraph,
      cts:field-word-query("paragraph", $p),
      ("score-simple")))
  order by number($count) descending
  return <result count="{$count}">{$p}</result>
)[1 to 10]

This is often very fast, but can take far too long if I happen to hit a
batch of documents/fragments that weren't hit recently.

Is there more I can do here?
Or is there a completely different aproach that may yield better results?
And how do I get mixed content results?

Thanks for reading through all this!

Kind regards,

Laurens van den Oever
Xopus BV

http://xopus.com
+31 70 4452345
KvK 27301795
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

[MarkLogic Dev General] RE: Sorting by the number of occurences of a paragraph

Reply via email to