Hi Danny,

 

I actually included "item-frequency" in my actual code, but neglected to
include it in the example.  I added it and set the default collation, but I
still get a mismatch.  Thanks for getting back to me..

 

Tim

 

  _____  

From: [email protected]
[mailto:[email protected]] On Behalf Of Danny Sokolsky
Sent: Thursday, June 04, 2009 12:19 PM
To: General Mark Logic Developer Discussion
Subject: RE: [MarkLogic Dev General] Collations, lexicon frequency
counts,and cst:search counts

 

Hi Tim,

 

There are a couple of things going on here.  

 

1)      As far as the lexicon counts from the e-v-m api , if you want the
total number of occurrences of the value, add the  "item-frequency" option
to the options.  "fragment-frequency" is the default, which will count once
per fragment, no matter how many matches in that fragment.

 

2)      As for the collations, you are comparing apples and oranges with the
e-v-match vs the e-v-query.  If you want them to be the same, declare the
collation in the query prolog as follows:

 

declare default collation "http://marklogic.com/collation//S1/AS/T00BB";;

 

The reason the collations match different things is that the collation you
specified is diacritic/case/punctuation/space-insensitive.  For example:

 

xquery version "1.0-ml";

 

declare default collation "http://marklogic.com/collation/";;

 

(: two spaces between the words  :)

"Some Title" eq "Some  Title"

 

returns false, but this:

 

xquery version "1.0-ml";

 

declare default collation "http://marklogic.com/collation//S1/AS/T00BB";; 

 

(: two spaces between the words  :)

"Some Title" eq "Some  Title"

 

returns true.

 

Clear as mud?

 

-Danny

                                                             

From: [email protected]
[mailto:[email protected]] On Behalf Of Tim Meagher
Sent: Thursday, June 04, 2009 4:56 AM
To: 'General Mark Logic Developer Discussion'
Subject: [MarkLogic Dev General] Collations, lexicon frequency counts, and
cst:search counts

 

Hi folks,

 

I have developed an analysis tool in which I am using a lexicon for book
titles to obtain quick frequency counts, to perform lexicon searches with
cts:element-value-match(), and then to perform a cts:search() using the
cts:element-value-query() function for each resulting book title in the
lexicon.  The problem is that the frequency counts for each book title do
not match the counts of occurrences of that book title in the cts:search().
The first book title lexicon was configured with the collation
"http://marklogic.com/collation//S1/AS/T00BB."; I wrote the following xquery
to demonstrate the problem:

 

xquery version "1.0-ml";

let $title := "Some Title"

let $matches := cts:element-value-match(xs:QName("BookTitle"), $title,
    (
        "collation=http://marklogic.com/collation//S1/AS/T00BB";, 
        (:"collation=http://marklogic.com/collation/",:)
        "case-insensitive" , "diacritic-insensitive", "item-order",
"ascending"
    ))

let $occurences := count ($matches)

for  $match in $matches
    
    let $results :=
cts:search(xdmp:directory("/DirectoryURI/","infinity")/Record, 
                cts:element-value-query(xs:QName("BookTitle"), $title,
("case-insensitive", "punctuation-insensitive", "diacritic-insensitive")))
    let $count := count($results)
    
    let $remainder := cts:remainder($results[1])
    
    return element match {
        attribute frequency {cts:frequency($match)},
        attribute count {$count},
        attribute remainder {$remainder},
        $match
   }

Here are the results:


<match frequency="981" count="1003" remainder="1003">Some Title</match>

I built another lexicon using the root collation and after it had reindexed,
I obtained the following results (using the root collation in the
cts:element-value-match() lexicon search function options):

 

<match frequency="20" count="1003" remainder="1003">Some Title</match>

Wow, what a difference the collation makes!  I'm a little perplexed as to
how to "make" the frequency count match up with the actual number of
occurrences and how to adjust the collation and lexicon search so that it
yields the same number of results as cts:search() with
cts:element-value-query().  I have reviewed the collation concepts at
http://userguide.icu-project.org/collation/concepts but I can't quite
determine what to do to ensure that the counts line up.

 

Tim Meagher - AAOM Consulting

 

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

Reply via email to