What I finally was realized is that cts:element-value-query() was performing
a stemmed search.  That seems to be the primary cause of count mismatches,
so the title "Some Titles" contributed to additional counts.

 

  _____  

From: [email protected]
[mailto:[email protected]] On Behalf Of Danny Sokolsky
Sent: Thursday, June 04, 2009 1:27 PM
To: General Mark Logic Developer Discussion
Subject: RE: [MarkLogic Dev General] Collations, lexicon frequency
counts,and cst:search counts

 

Hi Tim,

 

The other thing to keep in mind here is that cts:element-value-query and
cts:element-value-match have subtle differences in their semantics.  The
lexicon APIs (cts:element-values and cts:element-value-match) return exact
string matches per the collation.  The cts:element-value-query constructor
matches elements that match the phrase specified, and will include search
things like stemming, which are meant to match what you mean rather than
explicitly what you specified.

 

For the lexicon APIs, the equivalent cts:query constructors are the
range-query constructors like cts:element-range-query.   Try changing your
cts:query in the cts:search to a cts:element-range-query and see what
happens. 

 

-Danny

 

From: [email protected]
[mailto:[email protected]] On Behalf Of Tim Meagher
Sent: Thursday, June 04, 2009 10:05 AM
To: 'General Mark Logic Developer Discussion'
Subject: RE: [MarkLogic Dev General] Collations, lexicon frequency counts,
and cst:search counts

 

Hi Danny,

 

I actually included "item-frequency" in my actual code, but neglected to
include it in the example.  I added it and set the default collation, but I
still get a mismatch.  Thanks for getting back to me..

 

Tim

 

  _____  

From: [email protected]
[mailto:[email protected]] On Behalf Of Danny Sokolsky
Sent: Thursday, June 04, 2009 12:19 PM
To: General Mark Logic Developer Discussion
Subject: RE: [MarkLogic Dev General] Collations, lexicon frequency
counts,and cst:search counts

 

Hi Tim,

 

There are a couple of things going on here.  

 

1)      As far as the lexicon counts from the e-v-m api , if you want the
total number of occurrences of the value, add the  "item-frequency" option
to the options.  "fragment-frequency" is the default, which will count once
per fragment, no matter how many matches in that fragment.

 

2)      As for the collations, you are comparing apples and oranges with the
e-v-match vs the e-v-query.  If you want them to be the same, declare the
collation in the query prolog as follows:

 

declare default collation "http://marklogic.com/collation//S1/AS/T00BB";;

 

The reason the collations match different things is that the collation you
specified is diacritic/case/punctuation/space-insensitive.  For example:

 

xquery version "1.0-ml";

 

declare default collation "http://marklogic.com/collation/";;

 

(: two spaces between the words  :)

"Some Title" eq "Some  Title"

 

returns false, but this:

 

xquery version "1.0-ml";

 

declare default collation "http://marklogic.com/collation//S1/AS/T00BB";; 

 

(: two spaces between the words  :)

"Some Title" eq "Some  Title"

 

returns true.

 

Clear as mud?

 

-Danny

                                                             

From: [email protected]
[mailto:[email protected]] On Behalf Of Tim Meagher
Sent: Thursday, June 04, 2009 4:56 AM
To: 'General Mark Logic Developer Discussion'
Subject: [MarkLogic Dev General] Collations, lexicon frequency counts, and
cst:search counts

 

Hi folks,

 

I have developed an analysis tool in which I am using a lexicon for book
titles to obtain quick frequency counts, to perform lexicon searches with
cts:element-value-match(), and then to perform a cts:search() using the
cts:element-value-query() function for each resulting book title in the
lexicon.  The problem is that the frequency counts for each book title do
not match the counts of occurrences of that book title in the cts:search().
The first book title lexicon was configured with the collation
"http://marklogic.com/collation//S1/AS/T00BB."; I wrote the following xquery
to demonstrate the problem:

 

xquery version "1.0-ml";

let $title := "Some Title"

let $matches := cts:element-value-match(xs:QName("BookTitle"), $title,
    (
        "collation=http://marklogic.com/collation//S1/AS/T00BB";, 
        (:"collation=http://marklogic.com/collation/",:)
        "case-insensitive" , "diacritic-insensitive", "item-order",
"ascending"
    ))

let $occurences := count ($matches)

for  $match in $matches
    
    let $results :=
cts:search(xdmp:directory("/DirectoryURI/","infinity")/Record, 
                cts:element-value-query(xs:QName("BookTitle"), $title,
("case-insensitive", "punctuation-insensitive", "diacritic-insensitive")))
    let $count := count($results)
    
    let $remainder := cts:remainder($results[1])
    
    return element match {
        attribute frequency {cts:frequency($match)},
        attribute count {$count},
        attribute remainder {$remainder},
        $match
   }

Here are the results:


<match frequency="981" count="1003" remainder="1003">Some Title</match>

I built another lexicon using the root collation and after it had reindexed,
I obtained the following results (using the root collation in the
cts:element-value-match() lexicon search function options):

 

<match frequency="20" count="1003" remainder="1003">Some Title</match>

Wow, what a difference the collation makes!  I'm a little perplexed as to
how to "make" the frequency count match up with the actual number of
occurrences and how to adjust the collation and lexicon search so that it
yields the same number of results as cts:search() with
cts:element-value-query().  I have reviewed the collation concepts at
http://userguide.icu-project.org/collation/concepts but I can't quite
determine what to do to ensure that the counts line up.

 

Tim Meagher - AAOM Consulting

 

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

Reply via email to