Danny,
I'm glad that helped.
I suspect that iterating through the $logs sequence is actually streaming them
out of the DB, and you are seeing the time for document access off the disks
rather than categorization, or your machine may be swapping.
If the physical quantity of log data is large you'll be limited by various I/O
bottlenecks like the disks and network, and that may kick in now. I can
categorize 25,000 simple, generated docs via:
let $logs :=
for $i in 10 to 25000
return <log-entry id="{$i}"><type>{xdmp:random(20)}</type></log-entry>
in about 5 seconds, so I think it may be the document access.
What's your actual requirement? Nobody can view 250,000 docs on the screen at
once, so I assume they want rollups or averages? You can use range indexes or
random sampling to accomplish that without moving so much data across the
network.
Yours,
Damon
From: [email protected]
[mailto:[email protected]] On Behalf Of Danny Sinang
Sent: Monday, June 25, 2012 1:14 PM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Faster way to simulate SQL Group By in ML
xquery
Hi Damon,
Thanks. My script is generating group totals a lot faster now.
However, local:categorize() takes 3 seconds to load 25,000 XML docs into the
map.
let $groupMap := map:map()
let $_ := for $res in $logs
return local:categorize($groupMap, $res/book/bookId, $res)
This is just 3 days' worth of user logs.
We're expecting our customer to query a month's worth of logs at a time.
Is there an even faster solution ?
Regards,
Danny
On Fri, Jun 22, 2012 at 11:45 PM, Damon Feldman
<[email protected]<mailto:[email protected]>> wrote:
Danny,
It looks like this one is also suffering from use of predicates. They get
evaluated for every item in the sequence, so you get O(n^2) behavior when they
are in a for loop. This is one of the few places I would suggest using
procedural programming by adding values to a map:map structure. Here's some
sample code that runs in linear time:
declare function local:categorize($map, $key, $value) {
let $new-seq := (map:get($map, $key), $value)
return map:put($map, $key, $new-seq)
};
let $categorized := map:map()
let $items := for $i in 1 to 1000 return <item idx="{$i}">{if (xdmp:random(1)
eq 1) then "one" else "zero"}</item>
let $add-all :=
for $item in $items
return local:categorize($categorized, $item/text(), $item)
return
<groups>
<ones>{map:get($categorized, "one")}</ones>
<zeros>{map:get($categorized, "zero")}</zeros>
</groups>
Use map:keys($map) to get all "groups."
Yours,
Damon
________________________________
From:
[email protected]<mailto:[email protected]>
[[email protected]<mailto:[email protected]>]
On Behalf Of Danny Sinang [[email protected]<mailto:[email protected]>]
Sent: Friday, June 22, 2012 9:08 PM
To: general
Subject: [MarkLogic Dev General] Faster way to simulate SQL Group By in ML
xquery
Hi,
I'm trying to simulate a SQL SELECT ... Group By functionality via an ML xquery
script.
So I wrote a searchLogs function that uses search:search to return a list of
logs that match some given filters. So far, it runs very fast.
For the Group By part, I was able to write something that works but it's
turning out to be very slow. When searchLog returns 2,500 rows, it takes 1
minute or so to generate group totals.
The code is a bit complex as it handles multi-column groupBy's recursively. But
the basic logic goes like this :
let $results := local:seachLogs($searchQuery)
let $userIds := fn:distinct-values($results/userId)
for $userId in $userIds
let $userLogs := $results[userId=$userId]
let $userTotal := fn:count($userLogs)
return
<userId>{$userId}</userId>
<userTotal>{$userTotal}</userTotal>
<book> {
let $bookIds := fn:distinct-values($userLogs)
for $bookId in $bookIds
let $bookLogs := $userLogs[bookId=$bookId]
let $bookTotal := fn:count($bookLogs)
return
<bookId>{$bookId}</bookId>
<bookTotal>{$bookTotal}</bookTotal>
} </book>
I did some crude timings via xdmp:log() and saw that the red lines above eat up
like 30 milliseconds for each user / book row. With thousands of rows to be
processed, the delays all add up and become noticeable.
Can anybody here suggest a way to speed this thing up dramatically ?
If not, I'm thinking of inserting the raw results into an SQL table and letting
SQL do the group totals.
Regards,
Danny
_______________________________________________
General mailing list
[email protected]<mailto:[email protected]>
http://community.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://community.marklogic.com/mailman/listinfo/general