Hi all,
I'm pretty new to MarkLogic, so chances are that I've made some trivial
mistake here.
I have roughly the following structure:
<manual>
<translation lang="..."><!-- no xml:lang due to legacy -->
<!-- arbritary nesting of other elements -->
<paragraph>
I have about 5000 manuals with on average 16 translations each, bringing the
total of distinct (!) paragraphs to 700000.
The goal is to stimulate content reuse from the authoring interface.
I want to show the authors about 10 paragraphs which contain a search phrase
and here it comes: ordered by the number of occurences of that paragraph in
the collection.
I assume that a distinct paragraph only occurs once in a translation.
I realize that I'm trying to achieve something close to impossible;
expecting fast results from a query that compares a large part of the db
against the whole db, but I'm amazed that I've come this far and I'd like to
see if I can get this to the next level.
I started with the following query:
(for $para in cts:search(//paragraph,
cts:element-word-query(xs:QName("paragraph"), "search phrase"))
let $count := xdmp:estimate(cts:search(//paragraph,
cts:element-word-query(xs:QName("paragraph"), $para)))
order by number($count) descending
return
<result count="{$count}">
{$para}
</result>
)[1 to 10]
There are two problems with this approach:
1. it is far too slow
2. it returns multiple occurrences of the same content
I've been able to improve performance with the following measures:
- Maximizing the number of initial search results.
- Refragmenting the database on <translation/> level.
- Made <paragraph/> the root of a field.
- Reduced the scope of the query to one language using a [...@lang="EN"]
predicate but that slowed things down.
- Simple scoring improved performance and accuracy as relevance seems to
contradict my quest for the most occurences.
To eliminate the multiple occurrences I've used fn:distinct-values, but the
downside is that it returns a string and I need the paragraph element
including all markup.
Now my new query is:
(for $p in fn:distinct-values(
cts:search(
/manual/translation//paragraph,
cts:field-word-query("paragraph", "search query"),
("score-simple"))[1 to 250])
let $count := xdmp:estimate(
cts:search(
/manual/translation//paragraph,
cts:field-word-query("paragraph", $p),
("score-simple")))
order by number($count) descending
return <result count="{$count}">{$p}</result>
)[1 to 10]
This is often very fast, but can take far too long if I happen to hit a
batch of documents/fragments that weren't hit recently.
Is there more I can do here?
Or is there a completely different aproach that may yield better results?
And how do I get mixed content results?
Thanks for reading through all this!
Kind regards,
Laurens van den Oever
Xopus BV
http://xopus.com
+31 70 4452345
KvK 27301795
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general