So the key algorithmic inefficient is the inner join on the generated series. Worst case scenario this compares every range to every date in the series, which for m ranges and n dates yields O(m*n) operations. The analysts in my shop currently write queries like this for billions of records against thousands of dates and then go and take 8 hour coffee breaks.
However, by realizing that the bounds on the ranges have a linear ordering one can speed this up to 0(m) using windowing functions on common table expressions. So what I am proposing is formalizing this optimization into a class of data types, that will hide the implementation details. On Fri, Oct 12, 2012 at 1:48 AM, Heikki Linnakangas <hlinnakan...@vmware.com > wrote: > On 11.10.2012 07:37, Aaron Sheldon wrote: > >> This would allow for a succinct syntax to do calculations such as >> finding the daily unique patient count given the intervals of their >> attendance in particular programs; a computation I encounter >> routinely as a statistician for a health services provider. >> > > Hmm. It's easy to get the count of unique patients on a particular date > with something like: > > select count(distinct patient) from attendance where interval && > '2012-10-12'::date > > I guess what you're after is to get that count for a range of days, in one > query, so that the result looks something like this: > > date | patients > -----------+------------ > 2012-10-05 | 20 > 2012-10-06 | 24 > 2012-10-07 | 30 > 2012-10-08 | 29 > > The way I think of that problem is that you need to join the dates you're > interested in with the attendance table. > > select date, count (distinct patientid) > from attendance > inner join ( > select '2012-10-04'::date + a AS date from generate_series(1,20) a > ) dates on interval @> date > group by date; > date | count > ------------+------- > 2012-10-05 | 11 > 2012-10-06 | 27 > 2012-10-07 | 47 > 2012-10-08 | 63 > 2012-10-09 | 83 > 2012-10-10 | 95 > 2012-10-11 | 80 > 2012-10-12 | 60 > 2012-10-13 | 35 > 2012-10-14 | 13 > (10 rows) > > I created the test table for that with: > > create table attendance (patientid int4 , interval daterange) > insert into attendance select id, daterange('2012-10-05'::date + > (random()*5)::int4, '2012-10-10'::date + (random()*5)::int4) from > generate_series(1,100) id; > > > So, I think the current range types already cover that use case pretty > well. I can't imagine how the proposed measure theoretic concepts would > make that simpler. Can you give some more complicated problem, perhaps, > that the proposed measure theoretic concepts would make simpler than the > current tools? > > - Heikki > -- Aaron Sheldon #67 Westedge Townhouses 5019 46 Ave, SW Calgary AB, T3E 6R1 (h) 1.403.453.6316 (c) 1.403.710.9357 aaron.shel...@gmail.com