Re: [MarkLogic Dev General] Getting Impossible Value from count()--why?

Eliot Kimber Wed, 23 Aug 2017 07:35:17 -0700

Yes, I added “item-frequency” to my cts:element-values() call and now all the 
numbers appear to be correct.


I haven’t circled back to my original issue with the ML-provided search buckets 
not being the right size but if I have time I’ll see if the issue was failing 
to specify item-frequency at some point.

Cheers,

E.
-----
On 8/23/17, 2:37 AM, "[email protected] on behalf of 
Geert Josten" <[email protected] on behalf of 
[email protected]> wrote:

    Hi Eliot,
    
    Keep in mind that you pass in item-frequency in cts:element-values, but
    the default for range constraints is likely fragment-frequency. Did you
    pass in an item-frequency facet-option in there too?
    
    Kind regards,
    Geert
    
    On 8/22/17, 10:47 PM, "[email protected] on behalf
    of Eliot Kimber" <[email protected] on behalf of
    [email protected]> wrote:
    
    >If I sum the counts of each bucket calculated using cts:frequency() it
    >matches the total calculated using the initial result from the
    >element-values() query, so I guess the 10,000 count is a side effect of
    >some internal lexicon implementation magic.
    >
    >Cheers,
    >
    >E.
    >
    >--
    >Eliot Kimber
    >http://contrext.com
    > 
    >
    >
    >On 8/22/17, 3:25 PM, "[email protected] on behalf
    >of Eliot Kimber" <[email protected] on behalf of
    >[email protected]> wrote:
    >
    >    I think this is again my weak understanding of lexicons and frequency
    >counting. 
    >    
    >    If I change my code to sum the frequencies of the durations in each
    >range then I get more sensible numbers, e.g.:
    >    
    >    let $count := sum(for $dur in $durations[. lt $upper-bound][. ge
    >$lower-bound] return cts:frequency($dur))
    >    
    >    Having updated get-enrichment-durations() to:
    >    
    >    cts:element-values(xs:QName("prof:overall-elapsed"), (),
    >("descending", "item-frequency"),
    >                             cts:collection-query($collection))
    >    
    >    It still seems odd that the pure lexicon check returns exactly 10,000
    >*values*--that still seems suspect, but then using those 10,000 values to
    >calculate the total frequency does result in a more likely number. I
    >guess I can do some brute-force querying to see if it¹s accurate.
    >    
    >    Cheers,
    >    
    >    Eliot
    >    --
    >    Eliot Kimber
    >    http://contrext.com
    >     
    >    
    >    
    >    On 8/22/17, 2:52 PM, "[email protected] on
    >behalf of Eliot Kimber" <[email protected] on
    >behalf of [email protected]> wrote:
    >    
    >        Using ML 8.0-3.2
    >        
    >        As part of my profiling application I run a large number of
    >profiles, storing the profiler results back to the database. I¹m then
    >extracting the times from the profiling data to create histograms and do
    >other analysis.
    >        
    >        My first attempt to do this with buckets ran into the problem
    >that the index-based buckets were not returning accurate numbers, so I
    >reimplemented it to construct the buckets manually from a list of the
    >actual duration values.
    >        
    >        My code is:
    >        
    >        let $durations as xs:dayTimeDuration* :=
    >epf:get-enrichment-durations($collection)
    >        let $search-range := epf:construct-search-range()
    >        let $facets :=
    >            for $bucket in $search-range/search:bucket
    >            let $upper-bound := if ($bucket/@lt) then
    >xs:dayTimeDuration($bucket/@lt) else xs:dayTimeDuration("PT0S")
    >            let $lower-bound := xs:dayTimeDuration($bucket/@ge)
    >            let $count := count($durations[. lt $upper-bound][. ge
    >$lower-bound]) 
    >            return if ($count gt 0)
    >                   then <search:facet-value name="bucket-{$upper-bound}"
    >count="{$count}">{epf:format-day-time-duration($upper-bound)}</search:face
    >t-value>
    >                   else ()
    >        
    >        The get-enrichment-durations() function does this:
    >        
    >          cts:element-values(xs:QName("prof:overall-elapsed"), (),
    >"descending",
    >                             cts:collection-query($collection))
    >        
    >        This works nicely and seems to provide correct numbers except
    >when the number of durations within a particular set of bounds exceeds
    >10,000, at which point count() returns 10,000, which is an impossible
    >number‹the chance of there being exactly 10,000 instances within a given
    >range is basically zero. But I¹m getting 10,000 twice, which is
    >absolutely impossible.
    >        
    >        Here¹s the results I get from running this in the query console:
    >        
    >        <result>
    >        <count-durations>75778</count-durations>
    >        <facets>
    >        <search:facet-value name="bucket-PT0.01S" count="3"
    >xmlns:search="http://marklogic.com/appservices/search";>0.01
    >seconds</search:facet-value>
    >        <search:facet-value name="bucket-PT0.02S" count="7280"
    >xmlns:search="http://marklogic.com/appservices/search";>0.02
    >seconds</search:facet-value>
    >        <search:facet-value name="bucket-PT0.03S" count="10000"
    >xmlns:search="http://marklogic.com/appservices/search";>0.03
    >seconds</search:facet-value>
    >        <search:facet-value name="bucket-PT0.04S" count="10000"
    >xmlns:search="http://marklogic.com/appservices/search";>0.04
    >seconds</search:facet-value>
    >        <search:facet-value name="bucket-PT0.05S" count="9984"
    >xmlns:search="http://marklogic.com/appservices/search";>0.05
    >seconds</search:facet-value>
    >         Š
    >        </facets>
    >        </result>
    >        
    >        There are 75,778 actual duration values and the count value for
    >the 3rd and 4th ranges are exactly 10,000.
    >        
    >        If I change the let $count := expression to only test the upper
    >or lower bound then I get numbers greater than 10,000. I also tried
    >changing the order of the predicates and using a single predicate with
    >³and². The problem only seems to be related to using both predicates when
    >the resulting sequence would have more than 10K items.
    >        
    >        Is there an explanation for why count() gives me exactly 10,000
    >in this case?
    >        
    >        Is there a workaround for this behavior?
    >        
    >        The search range I¹m constructing is normal ML-defined markup for
    >defining a search range, e.g.:
    >        
    >        <search:range type="xs:dayTimeDuration" facet="true"
    >xmlns:search="http://marklogic.com/appservices/search";>
    >        <search:bucket ge="PT0.000S" lt="PT0.001S" name="100th">0.001
    >Second</search:bucket>
    >        <search:bucket ge="PT0.001S" lt="PT0.002S" name="200th">0.002
    >Second</search:bucket>
    >        <search:bucket ge="PT0.002S" lt="PT0.003S" name="300th">0.003
    >Second</search:bucket>
    >        <search:bucket ge="PT0.003S" lt="PT0.004S" name="400th">0.004
    >Second</search:bucket>
    >        <search:bucket ge="PT0.004S" lt="PT0.005S" name="500th">0.005
    >Second</search:bucket>
    >        Š
    >        </search:range>
    >        
    >        Thanks,
    >        
    >        Eliot
    >        --
    >        Eliot Kimber
    >        http://contrext.com
    >         
    >        
    >        
    >        
    >        _______________________________________________
    >        General mailing list
    >        [email protected]
    >        Manage your subscription at:
    >        http://developer.marklogic.com/mailman/listinfo/general
    >        
    >    
    >    
    >    _______________________________________________
    >    General mailing list
    >    [email protected]
    >    Manage your subscription at:
    >    http://developer.marklogic.com/mailman/listinfo/general
    >    
    >
    >
    >_______________________________________________
    >General mailing list
    >[email protected]
    >Manage your subscription at:
    >http://developer.marklogic.com/mailman/listinfo/general
    
    _______________________________________________
    General mailing list
    [email protected]
    Manage your subscription at: 
    http://developer.marklogic.com/mailman/listinfo/general
    





_______________________________________________
General mailing list
[email protected]
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Getting Impossible Value from count()--why?

Reply via email to