Yes, I added “item-frequency” to my cts:element-values() call and now all the numbers appear to be correct.
I haven’t circled back to my original issue with the ML-provided search buckets not being the right size but if I have time I’ll see if the issue was failing to specify item-frequency at some point. Cheers, E. ----- On 8/23/17, 2:37 AM, "[email protected] on behalf of Geert Josten" <[email protected] on behalf of [email protected]> wrote: Hi Eliot, Keep in mind that you pass in item-frequency in cts:element-values, but the default for range constraints is likely fragment-frequency. Did you pass in an item-frequency facet-option in there too? Kind regards, Geert On 8/22/17, 10:47 PM, "[email protected] on behalf of Eliot Kimber" <[email protected] on behalf of [email protected]> wrote: >If I sum the counts of each bucket calculated using cts:frequency() it >matches the total calculated using the initial result from the >element-values() query, so I guess the 10,000 count is a side effect of >some internal lexicon implementation magic. > >Cheers, > >E. > >-- >Eliot Kimber >http://contrext.com > > > >On 8/22/17, 3:25 PM, "[email protected] on behalf >of Eliot Kimber" <[email protected] on behalf of >[email protected]> wrote: > > I think this is again my weak understanding of lexicons and frequency >counting. > > If I change my code to sum the frequencies of the durations in each >range then I get more sensible numbers, e.g.: > > let $count := sum(for $dur in $durations[. lt $upper-bound][. ge >$lower-bound] return cts:frequency($dur)) > > Having updated get-enrichment-durations() to: > > cts:element-values(xs:QName("prof:overall-elapsed"), (), >("descending", "item-frequency"), > cts:collection-query($collection)) > > It still seems odd that the pure lexicon check returns exactly 10,000 >*values*--that still seems suspect, but then using those 10,000 values to >calculate the total frequency does result in a more likely number. I >guess I can do some brute-force querying to see if it¹s accurate. > > Cheers, > > Eliot > -- > Eliot Kimber > http://contrext.com > > > > On 8/22/17, 2:52 PM, "[email protected] on >behalf of Eliot Kimber" <[email protected] on >behalf of [email protected]> wrote: > > Using ML 8.0-3.2 > > As part of my profiling application I run a large number of >profiles, storing the profiler results back to the database. I¹m then >extracting the times from the profiling data to create histograms and do >other analysis. > > My first attempt to do this with buckets ran into the problem >that the index-based buckets were not returning accurate numbers, so I >reimplemented it to construct the buckets manually from a list of the >actual duration values. > > My code is: > > let $durations as xs:dayTimeDuration* := >epf:get-enrichment-durations($collection) > let $search-range := epf:construct-search-range() > let $facets := > for $bucket in $search-range/search:bucket > let $upper-bound := if ($bucket/@lt) then >xs:dayTimeDuration($bucket/@lt) else xs:dayTimeDuration("PT0S") > let $lower-bound := xs:dayTimeDuration($bucket/@ge) > let $count := count($durations[. lt $upper-bound][. ge >$lower-bound]) > return if ($count gt 0) > then <search:facet-value name="bucket-{$upper-bound}" >count="{$count}">{epf:format-day-time-duration($upper-bound)}</search:face >t-value> > else () > > The get-enrichment-durations() function does this: > > cts:element-values(xs:QName("prof:overall-elapsed"), (), >"descending", > cts:collection-query($collection)) > > This works nicely and seems to provide correct numbers except >when the number of durations within a particular set of bounds exceeds >10,000, at which point count() returns 10,000, which is an impossible >number‹the chance of there being exactly 10,000 instances within a given >range is basically zero. But I¹m getting 10,000 twice, which is >absolutely impossible. > > Here¹s the results I get from running this in the query console: > > <result> > <count-durations>75778</count-durations> > <facets> > <search:facet-value name="bucket-PT0.01S" count="3" >xmlns:search="http://marklogic.com/appservices/search">0.01 >seconds</search:facet-value> > <search:facet-value name="bucket-PT0.02S" count="7280" >xmlns:search="http://marklogic.com/appservices/search">0.02 >seconds</search:facet-value> > <search:facet-value name="bucket-PT0.03S" count="10000" >xmlns:search="http://marklogic.com/appservices/search">0.03 >seconds</search:facet-value> > <search:facet-value name="bucket-PT0.04S" count="10000" >xmlns:search="http://marklogic.com/appservices/search">0.04 >seconds</search:facet-value> > <search:facet-value name="bucket-PT0.05S" count="9984" >xmlns:search="http://marklogic.com/appservices/search">0.05 >seconds</search:facet-value> > Š > </facets> > </result> > > There are 75,778 actual duration values and the count value for >the 3rd and 4th ranges are exactly 10,000. > > If I change the let $count := expression to only test the upper >or lower bound then I get numbers greater than 10,000. I also tried >changing the order of the predicates and using a single predicate with >³and². The problem only seems to be related to using both predicates when >the resulting sequence would have more than 10K items. > > Is there an explanation for why count() gives me exactly 10,000 >in this case? > > Is there a workaround for this behavior? > > The search range I¹m constructing is normal ML-defined markup for >defining a search range, e.g.: > > <search:range type="xs:dayTimeDuration" facet="true" >xmlns:search="http://marklogic.com/appservices/search"> > <search:bucket ge="PT0.000S" lt="PT0.001S" name="100th">0.001 >Second</search:bucket> > <search:bucket ge="PT0.001S" lt="PT0.002S" name="200th">0.002 >Second</search:bucket> > <search:bucket ge="PT0.002S" lt="PT0.003S" name="300th">0.003 >Second</search:bucket> > <search:bucket ge="PT0.003S" lt="PT0.004S" name="400th">0.004 >Second</search:bucket> > <search:bucket ge="PT0.004S" lt="PT0.005S" name="500th">0.005 >Second</search:bucket> > Š > </search:range> > > Thanks, > > Eliot > -- > Eliot Kimber > http://contrext.com > > > > > _______________________________________________ > General mailing list > [email protected] > Manage your subscription at: > http://developer.marklogic.com/mailman/listinfo/general > > > > _______________________________________________ > General mailing list > [email protected] > Manage your subscription at: > http://developer.marklogic.com/mailman/listinfo/general > > > >_______________________________________________ >General mailing list >[email protected] >Manage your subscription at: >http://developer.marklogic.com/mailman/listinfo/general _______________________________________________ General mailing list [email protected] Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general _______________________________________________ General mailing list [email protected] Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
