Danny,

It looks like this one is also suffering from use of predicates. They get 
evaluated for every item in the sequence, so you get O(n^2) behavior when they 
are in a for loop. This is one of the few places I would suggest using 
procedural programming by adding values to a map:map structure. Here's some 
sample code that runs in linear time:


declare function local:categorize($map, $key, $value) {
  let $new-seq := (map:get($map, $key), $value)
  return map:put($map, $key, $new-seq)
};

let $categorized := map:map()
let $items := for $i in 1 to 1000 return <item idx="{$i}">{if (xdmp:random(1) 
eq 1) then "one" else "zero"}</item>
let $add-all :=
  for $item in $items
  return local:categorize($categorized, $item/text(), $item)
return
  <groups>
    <ones>{map:get($categorized, "one")}</ones>
    <zeros>{map:get($categorized, "zero")}</zeros>
  </groups>

Use map:keys($map) to get all "groups."

Yours,
Damon

________________________________
From: [email protected] 
[[email protected]] On Behalf Of Danny Sinang 
[[email protected]]
Sent: Friday, June 22, 2012 9:08 PM
To: general
Subject: [MarkLogic Dev General] Faster way to simulate SQL Group By in ML 
xquery

Hi,

I'm trying to simulate a SQL SELECT ... Group By functionality via an ML xquery 
script.

So I wrote a searchLogs function that uses search:search to return a list of 
logs that match some given filters. So far, it runs very fast.

For the Group By part, I was able to write something that works but it's 
turning out to be very slow. When searchLog returns 2,500 rows, it takes 1 
minute or so to generate group totals.

The code is a bit complex as it handles multi-column groupBy's recursively. But 
the basic logic goes like this :

let $results := local:seachLogs($searchQuery)
let $userIds := fn:distinct-values($results/userId)
for $userId in $userIds
let $userLogs := $results[userId=$userId]
let $userTotal := fn:count($userLogs)
return
            <userId>{$userId}</userId>
            <userTotal>{$userTotal}</userTotal>
            <book> {
            let $bookIds := fn:distinct-values($userLogs)
            for $bookId in $bookIds
            let $bookLogs := $userLogs[bookId=$bookId]
            let $bookTotal := fn:count($bookLogs)
            return
            <bookId>{$bookId}</bookId>
            <bookTotal>{$bookTotal}</bookTotal>

            } </book>

I did some crude timings via xdmp:log() and saw that the red lines above eat up 
like 30 milliseconds for each user / book row. With thousands of rows to be 
processed, the delays all add up and become noticeable.

Can anybody here suggest a way to speed this thing up dramatically ?

If not, I'm thinking of inserting the raw results into an SQL table and letting 
SQL do the group totals.

Regards,
Danny

_______________________________________________
General mailing list
[email protected]
http://community.marklogic.com/mailman/listinfo/general

Reply via email to