I ran some profiles as well, I think your profile results could have been a bit 
misleading. I suspect walking over uris, and summing frequencies is one of the 
major slow parts. I managed to create a UDF though (my first one). It is 
relatively generic, and should allow summing frequencies of any elem/attrib. 
You can download it from here:

https://github.com/grtjn/doc-count-udf

Get/clone it, run `make` (pref on the target env), and follow instructions to 
install it. After that you can run:

let $uris := cts:aggregate(
  "gjosten/doc-count",
  "doc-count",
  (
    cts:uri-reference(),
    cts:element-attribute-reference(xs:QName("file"), xs:QName("size"))
  )
)
let $counts := -$uris
let $top-keys :=
  for $key in map:keys($counts)
  order by xs:int($key) descending
  return $key
return (
  for $key in $top-keys
  for $value in map:get($counts, $key)
  return $value || " - " || $key
)[1 to 10]

I tested with 1k docs, and my earlier tuples approach took 14 sec with that, 
less than 1 sec with this..

Cheers,
Geert

From: Johan Mörén <[email protected]<mailto:[email protected]>>
Reply-To: MarkLogic Developer Discussion 
<[email protected]<mailto:[email protected]>>
Date: Sunday, June 28, 2015 at 12:07 AM
To: MarkLogic Developer Discussion 
<[email protected]<mailto:[email protected]>>
Subject: Re: [MarkLogic Dev General] Find the document(s) with max occurrences 
of an element-attribute reference

Thanks again for looking into this Geert!

I tried a mix of your approach (minus the -$uris part)  and mine and got better 
results. But that will not give me the ability to sort the whole database based 
on occurrence. Just got me the document(s) with the maximum number of 
occurrences. I tried this query in production where we have 1.4 million 
documents and the total number of file-elements is roughly 25 million. Got the 
result back in about 3 minutes. So it was definitely an improvement.  But it 
will not scale over time. Thanks for looking down the UDF path. Hopefully this 
could lead to a more general an useful approach.

Cheers,
Johan

On Sat, Jun 27, 2015 at 8:06 PM Geert Josten 
<[email protected]<mailto:[email protected]>> wrote:
My approach was similar, but tried to sum all frequencies per uri. 
Unfortunately, that approach gets slower with more documents, and more distinct 
file sizes. Adding a simple count attribute or element in the file somewhere 
would greatly simplify the run-time calculation, and that is what I would 
normally recommend. For the sake of completeness I’ll give it some more thought 
to see if there are ways to improve on the 3 minutes. A UDF might be useful, 
would have to try that..

Cheers,
Geert

From: Johan Mörén <[email protected]<mailto:[email protected]>>
Reply-To: MarkLogic Developer Discussion 
<[email protected]<mailto:[email protected]>>
Date: Saturday, June 27, 2015 at 1:23 AM
To: MarkLogic Developer Discussion 
<[email protected]<mailto:[email protected]>>
Subject: Re: [MarkLogic Dev General] Find the document(s) with max occurrences 
of an element-attribute reference

Hi Christopher

I tried your approach but still without success. I think the case might be that 
your example is using a fixed vale for size ("yes"). And since frequency is 
based on the the value you get the right results.

Regards,
Johan



On Sat, Jun 27, 2015 at 12:34 AM Christopher Hamlin 
<[email protected]<mailto:[email protected]>> wrote:
Hi Johan,

Maybe I'm not clear on what you want.

I just tried something.  I created documents in a database using

xquery version "1.0-ml";
for $i in 1 to 100
let $doc := <doc>{(1 to $i)!<file size='yes'/>}</doc>
let $uri := '/'||$i||'.xml'
return xdmp:document-insert ($uri, $doc)

so for example

/1.xml =>

<doc>
<file size="yes"/>
</doc>

and

/2.xml =>

<doc>
<file size="yes"/>
<file size="yes"/>
</doc>

and so on.

With a file/@size element-attribute range index, the query

xquery version '1.0-ml';
let $uris := cts:uri-reference()
let $ea := cts:element-attribute-reference (xs:QName ('file'),
xs:QName ('size'),
'collation=http://marklogic.com/collation/codepoint')
return
    for $tuple in cts:value-tuples(($uris, $ea),
('item-frequency','frequency-order','descending','limit=3'))
    return fn:concat ($tuple[1], ' -> ', cts:frequency ($tuple))

returns

/100.xml -> 100
/99.xml -> 99
/98.xml -> 98
/97.xml -> 97
/96.xml -> 96
/95.xml -> 95
/94.xml -> 94
/93.xml -> 93
/92.xml -> 92
/91.xml -> 91

Is this close to what you want?

Regards,

Chris

On Fri, Jun 26, 2015 at 12:41 PM, Johan Mörén 
<[email protected]<mailto:[email protected]>> wrote:
> Hi Christopher!
>
> I'm not sure where you wan't me to use these options. But i tried to add
> them to the cts:value-tuples()  but that did not return the expected result.
>
> like this
>
> ...
> for $tuple in
>     cts:value-tuples(
>       (
>         cts:uri-reference(),
>         $sizeRef
>       ),
>       ("frequency-order","descending","limit=10")
>
>     )
> ...
>
> Regards,
> Johan
>
> On Fri, Jun 26, 2015 at 5:58 PM Christopher Hamlin 
> <[email protected]<mailto:[email protected]>>
> wrote:
>>
>> If you just want something like top ten, I think it's more direct
>> possibly.
>>
>> Can you try returning frequency-order, descending, limit=10? Are those
>> options you can use?
>>
>> _______________________________________________
>> General mailing list
>> [email protected]<mailto:[email protected]>
>> Manage your subscription at:
>> http://developer.marklogic.com/mailman/listinfo/general
>
>
> _______________________________________________
> General mailing list
> [email protected]<mailto:[email protected]>
> Manage your subscription at:
> http://developer.marklogic.com/mailman/listinfo/general
>
_______________________________________________
General mailing list
[email protected]<mailto:[email protected]>
Manage your subscription at:
http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]<mailto:[email protected]>
Manage your subscription at:
http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to