Re: [MarkLogic Dev General] Faster way to simulate SQL Group By in ML xquery

Damon Feldman Mon, 25 Jun 2012 15:04:52 -0700

Danny,

I'm glad that helped.


I suspect that iterating through the $logs sequence is actually streaming them 
out of the DB, and you are seeing the time for document access off the disks 
rather than categorization, or your machine may be swapping.

If the physical quantity of log data is large you'll be limited by various I/O 
bottlenecks like the disks and network, and that may kick in now. I can 
categorize 25,000 simple, generated docs via:

let $logs :=
  for $i in 10 to 25000
  return <log-entry id="{$i}"><type>{xdmp:random(20)}</type></log-entry>

in about 5 seconds, so I think it may be the document access.

What's your actual requirement? Nobody can view 250,000 docs on the screen at 
once, so I assume they want rollups or averages? You can use range indexes or 
random sampling to accomplish that without moving so much data across the 
network.

Yours,
Damon


From: [email protected] 
[mailto:[email protected]] On Behalf Of Danny Sinang
Sent: Monday, June 25, 2012 1:14 PM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Faster way to simulate SQL Group By in ML 
xquery

Hi Damon,

Thanks. My script is generating group totals a lot faster now.

However, local:categorize() takes 3 seconds to load 25,000 XML docs into the 
map.

       let $groupMap := map:map()
       let $_ := for $res in $logs
                 return local:categorize($groupMap, $res/book/bookId, $res)

This is just 3 days' worth of user logs.

We're expecting our customer to query a month's worth of logs at a time.

Is there an even faster solution ?

Regards,
Danny

On Fri, Jun 22, 2012 at 11:45 PM, Damon Feldman 
<[email protected]<mailto:[email protected]>> wrote:

Danny,

It looks like this one is also suffering from use of predicates. They get 
evaluated for every item in the sequence, so you get O(n^2) behavior when they 
are in a for loop. This is one of the few places I would suggest using 
procedural programming by adding values to a map:map structure. Here's some 
sample code that runs in linear time:


declare function local:categorize($map, $key, $value) {
  let $new-seq := (map:get($map, $key), $value)
  return map:put($map, $key, $new-seq)
};

let $categorized := map:map()
let $items := for $i in 1 to 1000 return <item idx="{$i}">{if (xdmp:random(1) 
eq 1) then "one" else "zero"}</item>
let $add-all :=
  for $item in $items
  return local:categorize($categorized, $item/text(), $item)
return
  <groups>
    <ones>{map:get($categorized, "one")}</ones>
    <zeros>{map:get($categorized, "zero")}</zeros>
  </groups>

Use map:keys($map) to get all "groups."

Yours,
Damon

________________________________
From: 
[email protected]<mailto:[email protected]>
 
[[email protected]<mailto:[email protected]>]
 On Behalf Of Danny Sinang [[email protected]<mailto:[email protected]>]
Sent: Friday, June 22, 2012 9:08 PM
To: general
Subject: [MarkLogic Dev General] Faster way to simulate SQL Group By in ML 
xquery
Hi,

I'm trying to simulate a SQL SELECT ... Group By functionality via an ML xquery 
script.

So I wrote a searchLogs function that uses search:search to return a list of 
logs that match some given filters. So far, it runs very fast.

For the Group By part, I was able to write something that works but it's 
turning out to be very slow. When searchLog returns 2,500 rows, it takes 1 
minute or so to generate group totals.

The code is a bit complex as it handles multi-column groupBy's recursively. But 
the basic logic goes like this :

let $results := local:seachLogs($searchQuery)
let $userIds := fn:distinct-values($results/userId)
for $userId in $userIds
let $userLogs := $results[userId=$userId]
let $userTotal := fn:count($userLogs)
return
            <userId>{$userId}</userId>
            <userTotal>{$userTotal}</userTotal>
            <book> {
            let $bookIds := fn:distinct-values($userLogs)
            for $bookId in $bookIds
            let $bookLogs := $userLogs[bookId=$bookId]
            let $bookTotal := fn:count($bookLogs)
            return
            <bookId>{$bookId}</bookId>
            <bookTotal>{$bookTotal}</bookTotal>

            } </book>

I did some crude timings via xdmp:log() and saw that the red lines above eat up 
like 30 milliseconds for each user / book row. With thousands of rows to be 
processed, the delays all add up and become noticeable.

Can anybody here suggest a way to speed this thing up dramatically ?

If not, I'm thinking of inserting the raw results into an SQL table and letting 
SQL do the group totals.

Regards,
Danny


_______________________________________________
General mailing list
[email protected]<mailto:[email protected]>
http://community.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://community.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Faster way to simulate SQL Group By in ML xquery

Reply via email to