Danny,
It looks like this one is also suffering from use of predicates. They get
evaluated for every item in the sequence, so you get O(n^2) behavior when they
are in a for loop. This is one of the few places I would suggest using
procedural programming by adding values to a map:map structure. Here's some
sample code that runs in linear time:
declare function local:categorize($map, $key, $value) {
let $new-seq := (map:get($map, $key), $value)
return map:put($map, $key, $new-seq)
};
let $categorized := map:map()
let $items := for $i in 1 to 1000 return <item idx="{$i}">{if (xdmp:random(1)
eq 1) then "one" else "zero"}</item>
let $add-all :=
for $item in $items
return local:categorize($categorized, $item/text(), $item)
return
<groups>
<ones>{map:get($categorized, "one")}</ones>
<zeros>{map:get($categorized, "zero")}</zeros>
</groups>
Use map:keys($map) to get all "groups."
Yours,
Damon
________________________________
From: [email protected]
[[email protected]] On Behalf Of Danny Sinang
[[email protected]]
Sent: Friday, June 22, 2012 9:08 PM
To: general
Subject: [MarkLogic Dev General] Faster way to simulate SQL Group By in ML
xquery
Hi,
I'm trying to simulate a SQL SELECT ... Group By functionality via an ML xquery
script.
So I wrote a searchLogs function that uses search:search to return a list of
logs that match some given filters. So far, it runs very fast.
For the Group By part, I was able to write something that works but it's
turning out to be very slow. When searchLog returns 2,500 rows, it takes 1
minute or so to generate group totals.
The code is a bit complex as it handles multi-column groupBy's recursively. But
the basic logic goes like this :
let $results := local:seachLogs($searchQuery)
let $userIds := fn:distinct-values($results/userId)
for $userId in $userIds
let $userLogs := $results[userId=$userId]
let $userTotal := fn:count($userLogs)
return
<userId>{$userId}</userId>
<userTotal>{$userTotal}</userTotal>
<book> {
let $bookIds := fn:distinct-values($userLogs)
for $bookId in $bookIds
let $bookLogs := $userLogs[bookId=$bookId]
let $bookTotal := fn:count($bookLogs)
return
<bookId>{$bookId}</bookId>
<bookTotal>{$bookTotal}</bookTotal>
} </book>
I did some crude timings via xdmp:log() and saw that the red lines above eat up
like 30 milliseconds for each user / book row. With thousands of rows to be
processed, the delays all add up and become noticeable.
Can anybody here suggest a way to speed this thing up dramatically ?
If not, I'm thinking of inserting the raw results into an SQL table and letting
SQL do the group totals.
Regards,
Danny
_______________________________________________
General mailing list
[email protected]
http://community.marklogic.com/mailman/listinfo/general