Hi Tim,

There are a couple of things going on here.


1)      As far as the lexicon counts from the e-v-m api , if you want the total 
number of occurrences of the value, add the  "item-frequency" option to the 
options.  "fragment-frequency" is the default, which will count once per 
fragment, no matter how many matches in that fragment.



2)      As for the collations, you are comparing apples and oranges with the 
e-v-match vs the e-v-query.  If you want them to be the same, declare the 
collation in the query prolog as follows:



declare default collation "http://marklogic.com/collation//S1/AS/T00BB";;



The reason the collations match different things is that the collation you 
specified is diacritic/case/punctuation/space-insensitive.  For example:



xquery version "1.0-ml";



declare default collation "http://marklogic.com/collation/";;



(: two spaces between the words  :)

"Some Title" eq "Some  Title"



returns false, but this:



xquery version "1.0-ml";



declare default collation "http://marklogic.com/collation//S1/AS/T00BB";;



(: two spaces between the words  :)

"Some Title" eq "Some  Title"



returns true.

Clear as mud?

-Danny

From: [email protected] 
[mailto:[email protected]] On Behalf Of Tim Meagher
Sent: Thursday, June 04, 2009 4:56 AM
To: 'General Mark Logic Developer Discussion'
Subject: [MarkLogic Dev General] Collations, lexicon frequency counts, and 
cst:search counts


Hi folks,



I have developed an analysis tool in which I am using a lexicon for book titles 
to obtain quick frequency counts, to perform lexicon searches with 
cts:element-value-match(), and then to perform a cts:search() using the 
cts:element-value-query() function for each resulting book title in the 
lexicon.  The problem is that the frequency counts for each book title do not 
match the counts of occurrences of that book title in the cts:search().  The 
first book title lexicon was configured with the collation 
"http://marklogic.com/collation//S1/AS/T00BB."; I wrote the following xquery to 
demonstrate the problem:



xquery version "1.0-ml";

let $title := "Some Title"

let $matches := cts:element-value-match(xs:QName("BookTitle"), $title,
    (
        "collation=http://marklogic.com/collation//S1/AS/T00BB";,
        (:"collation=http://marklogic.com/collation/",:)
        "case-insensitive" , "diacritic-insensitive", "item-order", "ascending"
    ))

let $occurences := count ($matches)

for  $match in $matches

    let $results := 
cts:search(xdmp:directory("/DirectoryURI/","infinity")/Record,
                cts:element-value-query(xs:QName("BookTitle"), $title, 
("case-insensitive", "punctuation-insensitive", "diacritic-insensitive")))
    let $count := count($results)

    let $remainder := cts:remainder($results[1])

    return element match {
        attribute frequency {cts:frequency($match)},
        attribute count {$count},
        attribute remainder {$remainder},
        $match
   }

Here are the results:

<match frequency="981" count="1003" remainder="1003">Some Title</match>

I built another lexicon using the root collation and after it had reindexed, I 
obtained the following results (using the root collation in the 
cts:element-value-match() lexicon search function options):



<match frequency="20" count="1003" remainder="1003">Some Title</match>

Wow, what a difference the collation makes!  I'm a little perplexed as to how 
to "make" the frequency count match up with the actual number of occurrences 
and how to adjust the collation and lexicon search so that it yields the same 
number of results as cts:search() with cts:element-value-query().  I have 
reviewed the collation concepts at 
http://userguide.icu-project.org/collation/concepts but I can't quite determine 
what to do to ensure that the counts line up.



Tim Meagher - AAOM Consulting


_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

Reply via email to