Hi Eliot,

Keep in mind that you pass in item-frequency in cts:element-values, but
the default for range constraints is likely fragment-frequency. Did you
pass in an item-frequency facet-option in there too?

Kind regards,
Geert

On 8/22/17, 10:47 PM, "[email protected] on behalf
of Eliot Kimber" <[email protected] on behalf of
[email protected]> wrote:

>If I sum the counts of each bucket calculated using cts:frequency() it
>matches the total calculated using the initial result from the
>element-values() query, so I guess the 10,000 count is a side effect of
>some internal lexicon implementation magic.
>
>Cheers,
>
>E.
>
>--
>Eliot Kimber
>http://contrext.com
> 
>
>
>On 8/22/17, 3:25 PM, "[email protected] on behalf
>of Eliot Kimber" <[email protected] on behalf of
>[email protected]> wrote:
>
>    I think this is again my weak understanding of lexicons and frequency
>counting. 
>    
>    If I change my code to sum the frequencies of the durations in each
>range then I get more sensible numbers, e.g.:
>    
>    let $count := sum(for $dur in $durations[. lt $upper-bound][. ge
>$lower-bound] return cts:frequency($dur))
>    
>    Having updated get-enrichment-durations() to:
>    
>    cts:element-values(xs:QName("prof:overall-elapsed"), (),
>("descending", "item-frequency"),
>                             cts:collection-query($collection))
>    
>    It still seems odd that the pure lexicon check returns exactly 10,000
>*values*--that still seems suspect, but then using those 10,000 values to
>calculate the total frequency does result in a more likely number. I
>guess I can do some brute-force querying to see if it¹s accurate.
>    
>    Cheers,
>    
>    Eliot
>    --
>    Eliot Kimber
>    http://contrext.com
>     
>    
>    
>    On 8/22/17, 2:52 PM, "[email protected] on
>behalf of Eliot Kimber" <[email protected] on
>behalf of [email protected]> wrote:
>    
>        Using ML 8.0-3.2
>        
>        As part of my profiling application I run a large number of
>profiles, storing the profiler results back to the database. I¹m then
>extracting the times from the profiling data to create histograms and do
>other analysis.
>        
>        My first attempt to do this with buckets ran into the problem
>that the index-based buckets were not returning accurate numbers, so I
>reimplemented it to construct the buckets manually from a list of the
>actual duration values.
>        
>        My code is:
>        
>        let $durations as xs:dayTimeDuration* :=
>epf:get-enrichment-durations($collection)
>        let $search-range := epf:construct-search-range()
>        let $facets :=
>            for $bucket in $search-range/search:bucket
>            let $upper-bound := if ($bucket/@lt) then
>xs:dayTimeDuration($bucket/@lt) else xs:dayTimeDuration("PT0S")
>            let $lower-bound := xs:dayTimeDuration($bucket/@ge)
>            let $count := count($durations[. lt $upper-bound][. ge
>$lower-bound]) 
>            return if ($count gt 0)
>                   then <search:facet-value name="bucket-{$upper-bound}"
>count="{$count}">{epf:format-day-time-duration($upper-bound)}</search:face
>t-value>
>                   else ()
>        
>        The get-enrichment-durations() function does this:
>        
>          cts:element-values(xs:QName("prof:overall-elapsed"), (),
>"descending",
>                             cts:collection-query($collection))
>        
>        This works nicely and seems to provide correct numbers except
>when the number of durations within a particular set of bounds exceeds
>10,000, at which point count() returns 10,000, which is an impossible
>number‹the chance of there being exactly 10,000 instances within a given
>range is basically zero. But I¹m getting 10,000 twice, which is
>absolutely impossible.
>        
>        Here¹s the results I get from running this in the query console:
>        
>        <result>
>        <count-durations>75778</count-durations>
>        <facets>
>        <search:facet-value name="bucket-PT0.01S" count="3"
>xmlns:search="http://marklogic.com/appservices/search";>0.01
>seconds</search:facet-value>
>        <search:facet-value name="bucket-PT0.02S" count="7280"
>xmlns:search="http://marklogic.com/appservices/search";>0.02
>seconds</search:facet-value>
>        <search:facet-value name="bucket-PT0.03S" count="10000"
>xmlns:search="http://marklogic.com/appservices/search";>0.03
>seconds</search:facet-value>
>        <search:facet-value name="bucket-PT0.04S" count="10000"
>xmlns:search="http://marklogic.com/appservices/search";>0.04
>seconds</search:facet-value>
>        <search:facet-value name="bucket-PT0.05S" count="9984"
>xmlns:search="http://marklogic.com/appservices/search";>0.05
>seconds</search:facet-value>
>         Š
>        </facets>
>        </result>
>        
>        There are 75,778 actual duration values and the count value for
>the 3rd and 4th ranges are exactly 10,000.
>        
>        If I change the let $count := expression to only test the upper
>or lower bound then I get numbers greater than 10,000. I also tried
>changing the order of the predicates and using a single predicate with
>³and². The problem only seems to be related to using both predicates when
>the resulting sequence would have more than 10K items.
>        
>        Is there an explanation for why count() gives me exactly 10,000
>in this case?
>        
>        Is there a workaround for this behavior?
>        
>        The search range I¹m constructing is normal ML-defined markup for
>defining a search range, e.g.:
>        
>        <search:range type="xs:dayTimeDuration" facet="true"
>xmlns:search="http://marklogic.com/appservices/search";>
>        <search:bucket ge="PT0.000S" lt="PT0.001S" name="100th">0.001
>Second</search:bucket>
>        <search:bucket ge="PT0.001S" lt="PT0.002S" name="200th">0.002
>Second</search:bucket>
>        <search:bucket ge="PT0.002S" lt="PT0.003S" name="300th">0.003
>Second</search:bucket>
>        <search:bucket ge="PT0.003S" lt="PT0.004S" name="400th">0.004
>Second</search:bucket>
>        <search:bucket ge="PT0.004S" lt="PT0.005S" name="500th">0.005
>Second</search:bucket>
>        Š
>        </search:range>
>        
>        Thanks,
>        
>        Eliot
>        --
>        Eliot Kimber
>        http://contrext.com
>         
>        
>        
>        
>        _______________________________________________
>        General mailing list
>        [email protected]
>        Manage your subscription at:
>        http://developer.marklogic.com/mailman/listinfo/general
>        
>    
>    
>    _______________________________________________
>    General mailing list
>    [email protected]
>    Manage your subscription at:
>    http://developer.marklogic.com/mailman/listinfo/general
>    
>
>
>_______________________________________________
>General mailing list
>[email protected]
>Manage your subscription at:
>http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to