So the key algorithmic inefficient is the inner join on the generated
series. Worst case scenario this compares every range to every date in the
series, which for m ranges and n dates yields O(m*n) operations. The
analysts in my shop currently write queries like this for billions of
records against thousands of dates and then go and take 8 hour coffee
breaks.

However, by realizing that the bounds on the ranges have a linear ordering
one can speed this up to 0(m) using windowing functions on common table
expressions.

So what I am proposing is formalizing this optimization into a class of
data types, that will hide the implementation details.



On Fri, Oct 12, 2012 at 1:48 AM, Heikki Linnakangas <hlinnakan...@vmware.com
> wrote:

> On 11.10.2012 07:37, Aaron Sheldon wrote:
>
>> This would allow for a succinct syntax to do calculations such as
>> finding the daily unique patient count given the intervals of their
>> attendance in particular programs; a computation I encounter
>> routinely as a statistician for a health services provider.
>>
>
> Hmm. It's easy to get the count of unique patients on a particular date
> with something like:
>
> select count(distinct patient) from attendance where interval &&
> '2012-10-12'::date
>
> I guess what you're after is to get that count for a range of days, in one
> query, so that the result looks something like this:
>
>    date    |  patients
> -----------+------------
> 2012-10-05 |       20
> 2012-10-06 |       24
> 2012-10-07 |       30
> 2012-10-08 |       29
>
> The way I think of that problem is that you need to join the dates you're
> interested in with the attendance table.
>
> select date, count (distinct patientid)
> from attendance
> inner join (
>   select '2012-10-04'::date + a AS date from generate_series(1,20) a
> ) dates on interval @> date
> group by date;
>     date    | count
> ------------+-------
>  2012-10-05 |    11
>  2012-10-06 |    27
>  2012-10-07 |    47
>  2012-10-08 |    63
>  2012-10-09 |    83
>  2012-10-10 |    95
>  2012-10-11 |    80
>  2012-10-12 |    60
>  2012-10-13 |    35
>  2012-10-14 |    13
> (10 rows)
>
> I created the test table for that with:
>
> create table attendance (patientid int4 , interval daterange)
> insert into attendance select id, daterange('2012-10-05'::date +
> (random()*5)::int4, '2012-10-10'::date + (random()*5)::int4) from
> generate_series(1,100) id;
>
>
> So, I think the current range types already cover that use case pretty
> well. I can't imagine how the proposed measure theoretic concepts would
> make that simpler. Can you give some more complicated problem, perhaps,
> that the proposed measure theoretic concepts would make simpler than the
> current tools?
>
> - Heikki
>



-- 
Aaron Sheldon

#67 Westedge Townhouses
5019 46 Ave, SW
Calgary AB, T3E 6R1

(h) 1.403.453.6316
(c) 1.403.710.9357
aaron.shel...@gmail.com

Reply via email to