Hi Tim, There are a couple of things going on here.
1) As far as the lexicon counts from the e-v-m api , if you want the total number of occurrences of the value, add the "item-frequency" option to the options. "fragment-frequency" is the default, which will count once per fragment, no matter how many matches in that fragment. 2) As for the collations, you are comparing apples and oranges with the e-v-match vs the e-v-query. If you want them to be the same, declare the collation in the query prolog as follows: declare default collation "http://marklogic.com/collation//S1/AS/T00BB"; The reason the collations match different things is that the collation you specified is diacritic/case/punctuation/space-insensitive. For example: xquery version "1.0-ml"; declare default collation "http://marklogic.com/collation/"; (: two spaces between the words :) "Some Title" eq "Some Title" returns false, but this: xquery version "1.0-ml"; declare default collation "http://marklogic.com/collation//S1/AS/T00BB"; (: two spaces between the words :) "Some Title" eq "Some Title" returns true. Clear as mud? -Danny From: [email protected] [mailto:[email protected]] On Behalf Of Tim Meagher Sent: Thursday, June 04, 2009 4:56 AM To: 'General Mark Logic Developer Discussion' Subject: [MarkLogic Dev General] Collations, lexicon frequency counts, and cst:search counts Hi folks, I have developed an analysis tool in which I am using a lexicon for book titles to obtain quick frequency counts, to perform lexicon searches with cts:element-value-match(), and then to perform a cts:search() using the cts:element-value-query() function for each resulting book title in the lexicon. The problem is that the frequency counts for each book title do not match the counts of occurrences of that book title in the cts:search(). The first book title lexicon was configured with the collation "http://marklogic.com/collation//S1/AS/T00BB." I wrote the following xquery to demonstrate the problem: xquery version "1.0-ml"; let $title := "Some Title" let $matches := cts:element-value-match(xs:QName("BookTitle"), $title, ( "collation=http://marklogic.com/collation//S1/AS/T00BB", (:"collation=http://marklogic.com/collation/",:) "case-insensitive" , "diacritic-insensitive", "item-order", "ascending" )) let $occurences := count ($matches) for $match in $matches let $results := cts:search(xdmp:directory("/DirectoryURI/","infinity")/Record, cts:element-value-query(xs:QName("BookTitle"), $title, ("case-insensitive", "punctuation-insensitive", "diacritic-insensitive"))) let $count := count($results) let $remainder := cts:remainder($results[1]) return element match { attribute frequency {cts:frequency($match)}, attribute count {$count}, attribute remainder {$remainder}, $match } Here are the results: <match frequency="981" count="1003" remainder="1003">Some Title</match> I built another lexicon using the root collation and after it had reindexed, I obtained the following results (using the root collation in the cts:element-value-match() lexicon search function options): <match frequency="20" count="1003" remainder="1003">Some Title</match> Wow, what a difference the collation makes! I'm a little perplexed as to how to "make" the frequency count match up with the actual number of occurrences and how to adjust the collation and lexicon search so that it yields the same number of results as cts:search() with cts:element-value-query(). I have reviewed the collation concepts at http://userguide.icu-project.org/collation/concepts but I can't quite determine what to do to ensure that the counts line up. Tim Meagher - AAOM Consulting
_______________________________________________ General mailing list [email protected] http://xqzone.com/mailman/listinfo/general
