Hi Kelly, Thank you for your excellent response. Your solution seems to do exactly what I need.
I have removed my fragmentation and field, set the hash-id attribute on all 4M paragraphs and added the attribute range index. Unfortunately I then got an exception that no element-attribute range index exists for the given element/attribute QNames. I couldn't find anything wrong with my settings and localnames/namepaces. I assume that the problem was caused by messing with the reindexing settings while refragmenting/reindexing. Is that possible? I've now removed the index and am waiting for the reindexing to complete. After that I will add the index again. > I also don't think you need to limit to a specific language, but that shouldn't slow things down if you want to use it The query-trace showed that the extra predicate needed to be filtered while the rest of the xpath could be resolved from the indexes. I had the feeling that removing it resulted in better performance, but I've not done any thorough testing and I had made other changes as well. I will let you know when I have the final results. Kind regards, Laurens van den Oever Xopus BV http://xopus.com +31 70 4452345 KvK 27301795 Date: Mon, 27 Jul 2009 10:34:34 -0700 From: Kelly Stirman <[email protected]> Subject: [MarkLogic Dev General] RE: Sorting by the number of occurences of a paragraph To: "[email protected]" <[email protected]> Hi Laurent, If I follow your design correctly, what I would do is the following: 1) iterate over all your paragraphs and use xdmp:md5() to generate a hash value 2) add this hash value as an attribute to each paragraph, e.g. <paragraph hash-id="abc123">hello world</paragraph> 3) create a string range index in the codepoint collation on the paragraph/@hash-id attribute Then to return paragraphs in frequency order, you can call cts:element-attribute-values(xs:QName("paragraph"),xs:QName("hash-id"),(),"item-frequency"). You can filter this list with any search expression by adding another the cts:query as another option (see below). This approach allows you to quickly get the hash-id in frequency order, with or without a cts:query. You'll then need to go get a paragraph that matches the hash-id. Because there may be many, you can simply grab the first. let $q:= "search phrase" for $id in cts:element-attribute-values(xs:QName("paragraph"),xs:QName("hash-id"),(),"item-frequency",$q) return element result {attribute count {cts:frequency($id)},(//paragra...@hash-id eq $id])[1]} Finally, before doing any of this, I would get rid of your fragmentation. You probably don't need fields, but we can continue to talk about how they might be useful for this task. I also don't think you need to limit to a specific language, but that shouldn't slow things down if you want to use it (be sure to look over our developer guide on using languages, and your server license *may* come into play on this subject). This should be very fast - well under a second as long as there aren't too many paragraphs being returned. Getting the hash-ids will be resolved out of the indexes, whereas each paragraph returned will incur a disk i/o. 100 or so results should be sub-second. Kelly Message: 4 Date: Mon, 27 Jul 2009 16:11:16 +0200 From: Laurens van den Oever <[email protected]> Subject: [MarkLogic Dev General] Sorting by the number of occurences of a paragraph To: [email protected] Message-ID: <[email protected]> Content-Type: text/plain; charset="iso-8859-1" Hi all, I'm pretty new to MarkLogic, so chances are that I've made some trivial mistake here. I have roughly the following structure: <manual> <translation lang="..."><!-- no xml:lang due to legacy --> <!-- arbritary nesting of other elements --> <paragraph> I have about 5000 manuals with on average 16 translations each, bringing the total of distinct (!) paragraphs to 700000. The goal is to stimulate content reuse from the authoring interface. I want to show the authors about 10 paragraphs which contain a search phrase and here it comes: ordered by the number of occurences of that paragraph in the collection. I assume that a distinct paragraph only occurs once in a translation. I realize that I'm trying to achieve something close to impossible; expecting fast results from a query that compares a large part of the db against the whole db, but I'm amazed that I've come this far and I'd like to see if I can get this to the next level. I started with the following query: (for $para in cts:search(//paragraph, cts:element-word-query(xs:QName("paragraph"), "search phrase")) let $count := xdmp:estimate(cts:search(//paragraph, cts:element-word-query(xs:QName("paragraph"), $para))) order by number($count) descending return <result count="{$count}"> {$para} </result> )[1 to 10] There are two problems with this approach: 1. it is far too slow 2. it returns multiple occurrences of the same content I've been able to improve performance with the following measures: - Maximizing the number of initial search results. - Refragmenting the database on <translation/> level. - Made <paragraph/> the root of a field. - Reduced the scope of the query to one language using a [...@lang="EN"] predicate but that slowed things down. - Simple scoring improved performance and accuracy as relevance seems to contradict my quest for the most occurences. To eliminate the multiple occurrences I've used fn:distinct-values, but the downside is that it returns a string and I need the paragraph element including all markup. Now my new query is: (for $p in fn:distinct-values( cts:search( /manual/translation//paragraph, cts:field-word-query("paragraph", "search query"), ("score-simple"))[1 to 250]) let $count := xdmp:estimate( cts:search( /manual/translation//paragraph, cts:field-word-query("paragraph", $p), ("score-simple"))) order by number($count) descending return <result count="{$count}">{$p}</result> )[1 to 10] This is often very fast, but can take far too long if I happen to hit a batch of documents/fragments that weren't hit recently. Is there more I can do here? Or is there a completely different aproach that may yield better results? And how do I get mixed content results? Thanks for reading through all this! Kind regards, Laurens van den Oever Xopus BV http://xopus.com +31 70 4452345 KvK 27301795
_______________________________________________ General mailing list [email protected] http://xqzone.com/mailman/listinfo/general
