RE: [MarkLogic Dev General] Most Frequently Occurring Words

Danny Sokolsky Tue, 11 Nov 2008 10:57:31 -0800

You can write some XQuery to iterate through every word in the document
and then count each unique word.  Below is an example of how you might
do this.  It uses a word lexicon to make it quick to count the unique
words in the document.  Then you have to look through each word in the
document and count them, keeping track of the counts as you go.  These
comparisons are the expensive part, as there are a lot of them.  If the
document is relatively small (say less than a few thousand words), it
will perform reasonably well.  As the document gets bigger, the number
of comparisons it has to make grows rapidly, and it does not take long
before you have to make 100s of millions of comparisons.


Depending on what you are trying to accomplish with this information,
there might be a better way.  For example, cts:search will find you the
most relevant documents based on your search query (cts:query).

Here is an example of how to brute-force iterate through a document and
count occurrences of each word, then grab the top 20.  It requires you
have a word lexicon configured for your database.  It uses the unique
words in the document (obtained via the word lexicon call) to iterate
through each word in the document.

xquery version "1.0-ml";

let $uri := "/mydocs/foo.xml"
let $unique-words := cts:words("", (), 
       cts:document-query($uri))
let $unique-count := fn:count($unique-words)
let $all-words := 
   for $token in cts:tokenize(doc($uri))
   return
     typeswitch ($token)
     case $token as cts:punctuation return ()
     case $token as cts:word return $token
     case $token as cts:space return ()
     default return ()
let $set-var := 0
let $unsorted :=
  for $z in $unique-words
  return 
  (  xdmp:set($set-var, 0),
     for $word in $all-words 
     return (
          if ($z eq $word)
          then (xdmp:set($set-var, $set-var + 1)
                )
          else () ),
              if ($set-var eq 0) then () else
                <word>
                  <text>{$z}</text>
                  <count>{$set-var}</count>
                </word> )
let $top-20 :=
  (for $sort in $unsorted
  order by xs:integer($sort//count/text()) descending
  return 
  $sort)[1 to 20]
return
<results>
  <unique-count>{$unique-count}</unique-count>
  <all-count>{fn:count($all-words)}</all-count>
  <words>{$top-20}</words>
</results>

-Danny

-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of
Sakthikumar, Vasu
Sent: Monday, November 10, 2008 10:29 AM
To: 'General Mark Logic Developer Discussion'
Cc: Masciovecchio, Thomas
Subject: [MarkLogic Dev General] Most Frequently Occurring Words

How do I find the top 20 most frequently occurring words in an xml
document?

Thanks

Vasu


DISCLAIMER:
This email may contain information which is confidential, proprietary
and/or legally privileged.  The unauthorized use, copying, distribution,
or disclosure of this email or any of its contents is prohibited and may
be unlawful.  If you have received this email in error, please
immediately delete it and notify the sender.
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

RE: [MarkLogic Dev General] Most Frequently Occurring Words

Reply via email to