Re: Derby SQL ideas needed

2010-09-19 Thread Alexey Serba
 SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS 
 activitycount, t3.windowstart AS starttime, t3.windowend AS endtime FROM 
 (...) t3
Do you have primary key in your t3 table?

 In Postgresql, what this does is to return the FIRST entire row matching each 
 distinct idbucket result.
FIRST based on which sort?

Lets say you want to return FIRST row based on t3.windowstart column
and you have primary key in t3 table. Then I believe your query can be
rewritten in the following ways:

1. Using subqueries
SELECT
bucket, primary_key, windowstart, etc
FROM
table AS t1
WHERE
windowstart=( SELECT max(windowstart) FROM table AS t2 WHERE
bucket = t1.bucket )

2. Using joins instead of subqueries ( in case Derby doesn't support
subqueries - not sure about that )
SELECT
t1.bucket, t1.primary_key, windowstart, etc
FROM
table AS t1
LEFT OUTER JOIN table AS t2 ON ( t1.bucket = t2.bucket AND
t2.windowstart  t1.windowstart )
WHERE
t2.primary_key IS NULL

HTH,
Alex

On Sat, Sep 18, 2010 at 2:28 PM, Karl Wright daddy...@gmail.com wrote:
 Hi Folks,

 For two of the report queries, ACF uses the following Postgresql
 construct, which sadly seems to have no Derby equivalent:

 SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount
 AS activitycount, t3.windowstart AS starttime, t3.windowend AS endtime
 FROM (...) t3

 In Postgresql, what this does is to return the FIRST entire row
 matching each distinct idbucket result.  If Derby had a FIRST()
 aggregate function, it would be the equivalent of:

 SELECT t3.bucket AS idbucket, FIRST(t3.activitycount) AS
 activitycount, FIRST(t3.windowstart) AS starttime, FIRST(t3.windowend)
 AS endtime FROM (...) t3 GROUP BY t3.bucket

 Unfortunately, Derby has no such aggregate function.  Furthermore, it
 would not be ideal if I were to do the work myself in ACF, because
 this is a resultset that needs to be paged through with offset and
 length, for presentation to the user and sorting, so it gets wrapped
 in another SELECT ... FROM (...) ORDER BY ... OFFSET ... LIMIT ...
 that does that part.

 Does anyone have any ideas and/or Derby contacts?  I'd really like the
 quick-start example to have a functional set of reports.

 Karl



Re: Derby SQL ideas needed

2010-09-19 Thread Karl Wright
FIRST based on which sort??

First based on the existing sort, which is crucial, because the sort
is by bucket ASC, activitycount DESC.  I'm looking for the row with
the highest activitycount, per bucket.

The other thing is that we cannot afford to use the same table
twice, as it is actually an extremely expensive query in its own
right, with multiple joins, select distinct's, etc. under the covers.
I'd be happy to post it but it may shock you. ;-)

Karl





On Sun, Sep 19, 2010 at 11:32 AM, Alexey Serba ase...@gmail.com wrote:
 SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS 
 activitycount, t3.windowstart AS starttime, t3.windowend AS endtime FROM 
 (...) t3
 Do you have primary key in your t3 table?

 In Postgresql, what this does is to return the FIRST entire row matching 
 each distinct idbucket result.
 FIRST based on which sort?

 Lets say you want to return FIRST row based on t3.windowstart column
 and you have primary key in t3 table. Then I believe your query can be
 rewritten in the following ways:

 1. Using subqueries
 SELECT
    bucket, primary_key, windowstart, etc
 FROM
    table AS t1
 WHERE
    windowstart=( SELECT max(windowstart) FROM table AS t2 WHERE
 bucket = t1.bucket )

 2. Using joins instead of subqueries ( in case Derby doesn't support
 subqueries - not sure about that )
 SELECT
    t1.bucket, t1.primary_key, windowstart, etc
 FROM
    table AS t1
    LEFT OUTER JOIN table AS t2 ON ( t1.bucket = t2.bucket AND
 t2.windowstart  t1.windowstart )
 WHERE
    t2.primary_key IS NULL

 HTH,
 Alex

 On Sat, Sep 18, 2010 at 2:28 PM, Karl Wright daddy...@gmail.com wrote:
 Hi Folks,

 For two of the report queries, ACF uses the following Postgresql
 construct, which sadly seems to have no Derby equivalent:

 SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount
 AS activitycount, t3.windowstart AS starttime, t3.windowend AS endtime
 FROM (...) t3

 In Postgresql, what this does is to return the FIRST entire row
 matching each distinct idbucket result.  If Derby had a FIRST()
 aggregate function, it would be the equivalent of:

 SELECT t3.bucket AS idbucket, FIRST(t3.activitycount) AS
 activitycount, FIRST(t3.windowstart) AS starttime, FIRST(t3.windowend)
 AS endtime FROM (...) t3 GROUP BY t3.bucket

 Unfortunately, Derby has no such aggregate function.  Furthermore, it
 would not be ideal if I were to do the work myself in ACF, because
 this is a resultset that needs to be paged through with offset and
 length, for presentation to the user and sorting, so it gets wrapped
 in another SELECT ... FROM (...) ORDER BY ... OFFSET ... LIMIT ...
 that does that part.

 Does anyone have any ideas and/or Derby contacts?  I'd really like the
 quick-start example to have a functional set of reports.

 Karl




Re: Derby SQL ideas needed

2010-09-19 Thread Alexey Serba
 The other thing is that we cannot afford to use the same table
 twice, as it is actually an extremely expensive query in its own
 right, with multiple joins, select distinct's, etc. under the covers.
Even if you create indexes on bucket and activitycount columns? It
might be that the query plans for these two queries (with distinct
on hack and subquery max/subquery order limit/join) would be the
same.

 I'd be happy to post it but it may shock you. ;-)
The way I indent SQL queries should say that I'm not afraid of
multipage queries :)


 Karl





 On Sun, Sep 19, 2010 at 11:32 AM, Alexey Serba ase...@gmail.com wrote:
 SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS 
 activitycount, t3.windowstart AS starttime, t3.windowend AS endtime FROM 
 (...) t3
 Do you have primary key in your t3 table?

 In Postgresql, what this does is to return the FIRST entire row matching 
 each distinct idbucket result.
 FIRST based on which sort?

 Lets say you want to return FIRST row based on t3.windowstart column
 and you have primary key in t3 table. Then I believe your query can be
 rewritten in the following ways:

 1. Using subqueries
 SELECT
    bucket, primary_key, windowstart, etc
 FROM
    table AS t1
 WHERE
    windowstart=( SELECT max(windowstart) FROM table AS t2 WHERE
 bucket = t1.bucket )

 2. Using joins instead of subqueries ( in case Derby doesn't support
 subqueries - not sure about that )
 SELECT
    t1.bucket, t1.primary_key, windowstart, etc
 FROM
    table AS t1
    LEFT OUTER JOIN table AS t2 ON ( t1.bucket = t2.bucket AND
 t2.windowstart  t1.windowstart )
 WHERE
    t2.primary_key IS NULL

 HTH,
 Alex

 On Sat, Sep 18, 2010 at 2:28 PM, Karl Wright daddy...@gmail.com wrote:
 Hi Folks,

 For two of the report queries, ACF uses the following Postgresql
 construct, which sadly seems to have no Derby equivalent:

 SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount
 AS activitycount, t3.windowstart AS starttime, t3.windowend AS endtime
 FROM (...) t3

 In Postgresql, what this does is to return the FIRST entire row
 matching each distinct idbucket result.  If Derby had a FIRST()
 aggregate function, it would be the equivalent of:

 SELECT t3.bucket AS idbucket, FIRST(t3.activitycount) AS
 activitycount, FIRST(t3.windowstart) AS starttime, FIRST(t3.windowend)
 AS endtime FROM (...) t3 GROUP BY t3.bucket

 Unfortunately, Derby has no such aggregate function.  Furthermore, it
 would not be ideal if I were to do the work myself in ACF, because
 this is a resultset that needs to be paged through with offset and
 length, for presentation to the user and sorting, so it gets wrapped
 in another SELECT ... FROM (...) ORDER BY ... OFFSET ... LIMIT ...
 that does that part.

 Does anyone have any ideas and/or Derby contacts?  I'd really like the
 quick-start example to have a functional set of reports.

 Karl





Re: Derby SQL ideas needed

2010-09-19 Thread Karl Wright
Here you go:

// The query we will generate here looks like this:
// SELECT *
//   FROM
// (SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket,
t3.bytecount AS bytecount,
//   t3.windowstart AS starttime,
t3.windowend AS endtime
//FROM (SELECT * FROM (SELECT t0.bucket AS bucket,
t0.starttime AS windowstart, t0.starttime + interval AS windowend,
//   SUM(t1.datasize * ((case when t0.starttime +
interval  t1.endtime then t0.starttime + interval else t1.endtime
end) -
// (case when t0.starttimet1.starttime then
t0.starttime else t1.starttime end))
//  / (t1.endtime - t1.starttime)) AS bytecount
//   FROM (SELECT DISTINCT substring(entityid from
'bucketregexp') AS bucket, starttime FROM repohistory WHERE
criteria) t0, repohistory t1
//   WHERE t0.bucket=substring(t1.entityid from
'bucket_regexp')
//  AND t1.starttime  t0.starttime +
interval AND t1.endtime  t0.starttime
//  AND criteria on t1
//  GROUP BY bucket,windowstart,windowend
//  UNION SELECT t0a.bucket AS bucket, t0a.endtime -
interval AS windowstart, t0a.endtime AS windowend,
//   SUM(t1a.datasize * ((case when t0a.endtime 
t1a.endtime then t0a.endtime else t1a.endtime end) -
// (case when t0a.endtime - interval 
t1a.starttime then t0a.endtime - interval else t1a.starttime end))
//  / (t1a.endtime - t1a.starttime)) AS bytecount
//   FROM (SELECT DISTINCT substring(entityid from
'bucketregexp') AS bucket, endtime FROM repohistory WHERE
criteria) t0a, repohistory t1a
//   WHERE t0a.bucket=substring(t1a.entityid from
'bucket_regexp')
//  AND (t1a.starttime  t0a.endtime AND
t1a.endtime  t0a.endtime - interval
//  AND criteria on t1a
//  GROUP BY bucket,windowstart,windowend) t2
//  ORDER BY bucket ASC,bytecount
DESC) t3) t4 ORDER BY xxx LIMIT yyy OFFSET zzz;

I have low confidence that ANY planner would be able to locate the
common part of a 2x larger query and not do it twice.

Karl



On Sun, Sep 19, 2010 at 12:05 PM, Alexey Serba ase...@gmail.com wrote:
 The other thing is that we cannot afford to use the same table
 twice, as it is actually an extremely expensive query in its own
 right, with multiple joins, select distinct's, etc. under the covers.
 Even if you create indexes on bucket and activitycount columns? It
 might be that the query plans for these two queries (with distinct
 on hack and subquery max/subquery order limit/join) would be the
 same.

 I'd be happy to post it but it may shock you. ;-)
 The way I indent SQL queries should say that I'm not afraid of
 multipage queries :)


 Karl





 On Sun, Sep 19, 2010 at 11:32 AM, Alexey Serba ase...@gmail.com wrote:
 SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS 
 activitycount, t3.windowstart AS starttime, t3.windowend AS endtime FROM 
 (...) t3
 Do you have primary key in your t3 table?

 In Postgresql, what this does is to return the FIRST entire row matching 
 each distinct idbucket result.
 FIRST based on which sort?

 Lets say you want to return FIRST row based on t3.windowstart column
 and you have primary key in t3 table. Then I believe your query can be
 rewritten in the following ways:

 1. Using subqueries
 SELECT
    bucket, primary_key, windowstart, etc
 FROM
    table AS t1
 WHERE
    windowstart=( SELECT max(windowstart) FROM table AS t2 WHERE
 bucket = t1.bucket )

 2. Using joins instead of subqueries ( in case Derby doesn't support
 subqueries - not sure about that )
 SELECT
    t1.bucket, t1.primary_key, windowstart, etc
 FROM
    table AS t1
    LEFT OUTER JOIN table AS t2 ON ( t1.bucket = t2.bucket AND
 t2.windowstart  t1.windowstart )
 WHERE
    t2.primary_key IS NULL

 HTH,
 Alex

 On Sat, Sep 18, 2010 at 2:28 PM, Karl Wright daddy...@gmail.com wrote:
 Hi Folks,

 For two of the report queries, ACF uses the following Postgresql
 construct, which sadly seems to have no Derby equivalent:

 SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount
 AS activitycount, t3.windowstart AS starttime, t3.windowend AS endtime
 FROM (...) t3

 In Postgresql, what this does is to return the FIRST entire row
 matching each distinct idbucket result.  If Derby had a FIRST()
 aggregate function, it would be the equivalent of:

 SELECT t3.bucket AS idbucket, FIRST(t3.activitycount) AS
 activitycount, FIRST(t3.windowstart) AS starttime, FIRST(t3.windowend)
 AS endtime FROM (...) t3 GROUP BY t3.bucket

 Unfortunately, Derby has no such aggregate function.  Furthermore, it
 would not be ideal if I were to do the work myself in ACF, because
 this is a resultset that needs to be 

Re: Derby SQL ideas needed

2010-09-19 Thread Karl Wright
Looking at your proposal:

SELECT
   bucket, primary_key, windowstart, etc
FROM
   table AS t1
WHERE
   windowstart=( SELECT max(windowstart) FROM table AS t2 WHERE
bucket = t1.bucket )

... we'd be looking actually for something more like this:


SELECT
   t1.bucket, t1.bytecount, t1.windowstart, t1.windowend
FROM
   (xxx) t1
WHERE
   t1.bytecount=( SELECT max(t2.bytecount) FROM (xxx) t2 WHERE
t2.bucket = t1.bucket )

... although I've never seen the =(SELECT...) structure before.

Karl


On Sun, Sep 19, 2010 at 12:48 PM, Karl Wright daddy...@gmail.com wrote:
 Here you go:

    // The query we will generate here looks like this:
    // SELECT *
    //   FROM
    //     (SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket,
 t3.bytecount AS bytecount,
    //                               t3.windowstart AS starttime,
 t3.windowend AS endtime
    //        FROM (SELECT * FROM (SELECT t0.bucket AS bucket,
 t0.starttime AS windowstart, t0.starttime + interval AS windowend,
    //                   SUM(t1.datasize * ((case when t0.starttime +
 interval  t1.endtime then t0.starttime + interval else t1.endtime
 end) -
    //                     (case when t0.starttimet1.starttime then
 t0.starttime else t1.starttime end))
    //                      / (t1.endtime - t1.starttime)) AS bytecount
    //                   FROM (SELECT DISTINCT substring(entityid from
 'bucketregexp') AS bucket, starttime FROM repohistory WHERE
 criteria) t0, repohistory t1
    //                   WHERE t0.bucket=substring(t1.entityid from
 'bucket_regexp')
    //                      AND t1.starttime  t0.starttime +
 interval AND t1.endtime  t0.starttime
    //                      AND criteria on t1
    //                          GROUP BY bucket,windowstart,windowend
    //              UNION SELECT t0a.bucket AS bucket, t0a.endtime -
 interval AS windowstart, t0a.endtime AS windowend,
    //                   SUM(t1a.datasize * ((case when t0a.endtime 
 t1a.endtime then t0a.endtime else t1a.endtime end) -
    //                     (case when t0a.endtime - interval 
 t1a.starttime then t0a.endtime - interval else t1a.starttime end))
    //                      / (t1a.endtime - t1a.starttime)) AS bytecount
    //                   FROM (SELECT DISTINCT substring(entityid from
 'bucketregexp') AS bucket, endtime FROM repohistory WHERE
 criteria) t0a, repohistory t1a
    //                   WHERE t0a.bucket=substring(t1a.entityid from
 'bucket_regexp')
    //                      AND (t1a.starttime  t0a.endtime AND
 t1a.endtime  t0a.endtime - interval
    //                      AND criteria on t1a
    //                          GROUP BY bucket,windowstart,windowend) t2
    //                              ORDER BY bucket ASC,bytecount
 DESC) t3) t4 ORDER BY xxx LIMIT yyy OFFSET zzz;

 I have low confidence that ANY planner would be able to locate the
 common part of a 2x larger query and not do it twice.

 Karl



 On Sun, Sep 19, 2010 at 12:05 PM, Alexey Serba ase...@gmail.com wrote:
 The other thing is that we cannot afford to use the same table
 twice, as it is actually an extremely expensive query in its own
 right, with multiple joins, select distinct's, etc. under the covers.
 Even if you create indexes on bucket and activitycount columns? It
 might be that the query plans for these two queries (with distinct
 on hack and subquery max/subquery order limit/join) would be the
 same.

 I'd be happy to post it but it may shock you. ;-)
 The way I indent SQL queries should say that I'm not afraid of
 multipage queries :)


 Karl





 On Sun, Sep 19, 2010 at 11:32 AM, Alexey Serba ase...@gmail.com wrote:
 SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS 
 activitycount, t3.windowstart AS starttime, t3.windowend AS endtime FROM 
 (...) t3
 Do you have primary key in your t3 table?

 In Postgresql, what this does is to return the FIRST entire row matching 
 each distinct idbucket result.
 FIRST based on which sort?

 Lets say you want to return FIRST row based on t3.windowstart column
 and you have primary key in t3 table. Then I believe your query can be
 rewritten in the following ways:

 1. Using subqueries
 SELECT
    bucket, primary_key, windowstart, etc
 FROM
    table AS t1
 WHERE
    windowstart=( SELECT max(windowstart) FROM table AS t2 WHERE
 bucket = t1.bucket )

 2. Using joins instead of subqueries ( in case Derby doesn't support
 subqueries - not sure about that )
 SELECT
    t1.bucket, t1.primary_key, windowstart, etc
 FROM
    table AS t1
    LEFT OUTER JOIN table AS t2 ON ( t1.bucket = t2.bucket AND
 t2.windowstart  t1.windowstart )
 WHERE
    t2.primary_key IS NULL

 HTH,
 Alex

 On Sat, Sep 18, 2010 at 2:28 PM, Karl Wright daddy...@gmail.com wrote:
 Hi Folks,

 For two of the report queries, ACF uses the following Postgresql
 construct, which sadly seems to have no Derby equivalent:

 SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount
 AS activitycount, t3.windowstart AS 

Re: Derby SQL ideas needed

2010-09-19 Thread Alexey Serba
You can also try ORDER BY bytecount DESC LIMIT 1 instead of aggregate
function max, i.e.

SELECT
t1.bucket, t1.bytecount, t1.windowstart, t1.windowend
FROM
(xxx) t1
WHERE
t1.bytecount=( SELECT t2.bytecount FROM (xxx) t2 WHERE t2.bucket =
t1.bucket ORDER BY t2.bytecount DESC LIMIT 1 )

On Sun, Sep 19, 2010 at 9:07 PM, Karl Wright daddy...@gmail.com wrote:
 Looking at your proposal:

 SELECT
   bucket, primary_key, windowstart, etc
 FROM
   table AS t1
 WHERE
   windowstart=( SELECT max(windowstart) FROM table AS t2 WHERE
 bucket = t1.bucket )

 ... we'd be looking actually for something more like this:


 SELECT
   t1.bucket, t1.bytecount, t1.windowstart, t1.windowend
 FROM
   (xxx) t1
 WHERE
   t1.bytecount=( SELECT max(t2.bytecount) FROM (xxx) t2 WHERE
 t2.bucket = t1.bucket )

 ... although I've never seen the =(SELECT...) structure before.

 Karl


 On Sun, Sep 19, 2010 at 12:48 PM, Karl Wright daddy...@gmail.com wrote:
 Here you go:

    // The query we will generate here looks like this:
    // SELECT *
    //   FROM
    //     (SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket,
 t3.bytecount AS bytecount,
    //                               t3.windowstart AS starttime,
 t3.windowend AS endtime
    //        FROM (SELECT * FROM (SELECT t0.bucket AS bucket,
 t0.starttime AS windowstart, t0.starttime + interval AS windowend,
    //                   SUM(t1.datasize * ((case when t0.starttime +
 interval  t1.endtime then t0.starttime + interval else t1.endtime
 end) -
    //                     (case when t0.starttimet1.starttime then
 t0.starttime else t1.starttime end))
    //                      / (t1.endtime - t1.starttime)) AS bytecount
    //                   FROM (SELECT DISTINCT substring(entityid from
 'bucketregexp') AS bucket, starttime FROM repohistory WHERE
 criteria) t0, repohistory t1
    //                   WHERE t0.bucket=substring(t1.entityid from
 'bucket_regexp')
    //                      AND t1.starttime  t0.starttime +
 interval AND t1.endtime  t0.starttime
    //                      AND criteria on t1
    //                          GROUP BY bucket,windowstart,windowend
    //              UNION SELECT t0a.bucket AS bucket, t0a.endtime -
 interval AS windowstart, t0a.endtime AS windowend,
    //                   SUM(t1a.datasize * ((case when t0a.endtime 
 t1a.endtime then t0a.endtime else t1a.endtime end) -
    //                     (case when t0a.endtime - interval 
 t1a.starttime then t0a.endtime - interval else t1a.starttime end))
    //                      / (t1a.endtime - t1a.starttime)) AS bytecount
    //                   FROM (SELECT DISTINCT substring(entityid from
 'bucketregexp') AS bucket, endtime FROM repohistory WHERE
 criteria) t0a, repohistory t1a
    //                   WHERE t0a.bucket=substring(t1a.entityid from
 'bucket_regexp')
    //                      AND (t1a.starttime  t0a.endtime AND
 t1a.endtime  t0a.endtime - interval
    //                      AND criteria on t1a
    //                          GROUP BY bucket,windowstart,windowend) t2
    //                              ORDER BY bucket ASC,bytecount
 DESC) t3) t4 ORDER BY xxx LIMIT yyy OFFSET zzz;

 I have low confidence that ANY planner would be able to locate the
 common part of a 2x larger query and not do it twice.

 Karl



 On Sun, Sep 19, 2010 at 12:05 PM, Alexey Serba ase...@gmail.com wrote:
 The other thing is that we cannot afford to use the same table
 twice, as it is actually an extremely expensive query in its own
 right, with multiple joins, select distinct's, etc. under the covers.
 Even if you create indexes on bucket and activitycount columns? It
 might be that the query plans for these two queries (with distinct
 on hack and subquery max/subquery order limit/join) would be the
 same.

 I'd be happy to post it but it may shock you. ;-)
 The way I indent SQL queries should say that I'm not afraid of
 multipage queries :)


 Karl





 On Sun, Sep 19, 2010 at 11:32 AM, Alexey Serba ase...@gmail.com wrote:
 SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS 
 activitycount, t3.windowstart AS starttime, t3.windowend AS endtime FROM 
 (...) t3
 Do you have primary key in your t3 table?

 In Postgresql, what this does is to return the FIRST entire row matching 
 each distinct idbucket result.
 FIRST based on which sort?

 Lets say you want to return FIRST row based on t3.windowstart column
 and you have primary key in t3 table. Then I believe your query can be
 rewritten in the following ways:

 1. Using subqueries
 SELECT
    bucket, primary_key, windowstart, etc
 FROM
    table AS t1
 WHERE
    windowstart=( SELECT max(windowstart) FROM table AS t2 WHERE
 bucket = t1.bucket )

 2. Using joins instead of subqueries ( in case Derby doesn't support
 subqueries - not sure about that )
 SELECT
    t1.bucket, t1.primary_key, windowstart, etc
 FROM
    table AS t1
    LEFT OUTER JOIN table AS t2 ON ( t1.bucket = t2.bucket AND
 t2.windowstart  

Re: Derby SQL ideas needed

2010-09-19 Thread Alexey Serba
And all of this is only with single table repohistory, right? Is this
some kind of complex analytics/stats?

On Sun, Sep 19, 2010 at 8:48 PM, Karl Wright daddy...@gmail.com wrote:
 Here you go:

    // The query we will generate here looks like this:
    // SELECT *
    //   FROM
    //     (SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket,
 t3.bytecount AS bytecount,
    //                               t3.windowstart AS starttime,
 t3.windowend AS endtime
    //        FROM (SELECT * FROM (SELECT t0.bucket AS bucket,
 t0.starttime AS windowstart, t0.starttime + interval AS windowend,
    //                   SUM(t1.datasize * ((case when t0.starttime +
 interval  t1.endtime then t0.starttime + interval else t1.endtime
 end) -
    //                     (case when t0.starttimet1.starttime then
 t0.starttime else t1.starttime end))
    //                      / (t1.endtime - t1.starttime)) AS bytecount
    //                   FROM (SELECT DISTINCT substring(entityid from
 'bucketregexp') AS bucket, starttime FROM repohistory WHERE
 criteria) t0, repohistory t1
    //                   WHERE t0.bucket=substring(t1.entityid from
 'bucket_regexp')
    //                      AND t1.starttime  t0.starttime +
 interval AND t1.endtime  t0.starttime
    //                      AND criteria on t1
    //                          GROUP BY bucket,windowstart,windowend
    //              UNION SELECT t0a.bucket AS bucket, t0a.endtime -
 interval AS windowstart, t0a.endtime AS windowend,
    //                   SUM(t1a.datasize * ((case when t0a.endtime 
 t1a.endtime then t0a.endtime else t1a.endtime end) -
    //                     (case when t0a.endtime - interval 
 t1a.starttime then t0a.endtime - interval else t1a.starttime end))
    //                      / (t1a.endtime - t1a.starttime)) AS bytecount
    //                   FROM (SELECT DISTINCT substring(entityid from
 'bucketregexp') AS bucket, endtime FROM repohistory WHERE
 criteria) t0a, repohistory t1a
    //                   WHERE t0a.bucket=substring(t1a.entityid from
 'bucket_regexp')
    //                      AND (t1a.starttime  t0a.endtime AND
 t1a.endtime  t0a.endtime - interval
    //                      AND criteria on t1a
    //                          GROUP BY bucket,windowstart,windowend) t2
    //                              ORDER BY bucket ASC,bytecount
 DESC) t3) t4 ORDER BY xxx LIMIT yyy OFFSET zzz;

 I have low confidence that ANY planner would be able to locate the
 common part of a 2x larger query and not do it twice.

 Karl



 On Sun, Sep 19, 2010 at 12:05 PM, Alexey Serba ase...@gmail.com wrote:
 The other thing is that we cannot afford to use the same table
 twice, as it is actually an extremely expensive query in its own
 right, with multiple joins, select distinct's, etc. under the covers.
 Even if you create indexes on bucket and activitycount columns? It
 might be that the query plans for these two queries (with distinct
 on hack and subquery max/subquery order limit/join) would be the
 same.

 I'd be happy to post it but it may shock you. ;-)
 The way I indent SQL queries should say that I'm not afraid of
 multipage queries :)


 Karl





 On Sun, Sep 19, 2010 at 11:32 AM, Alexey Serba ase...@gmail.com wrote:
 SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS 
 activitycount, t3.windowstart AS starttime, t3.windowend AS endtime FROM 
 (...) t3
 Do you have primary key in your t3 table?

 In Postgresql, what this does is to return the FIRST entire row matching 
 each distinct idbucket result.
 FIRST based on which sort?

 Lets say you want to return FIRST row based on t3.windowstart column
 and you have primary key in t3 table. Then I believe your query can be
 rewritten in the following ways:

 1. Using subqueries
 SELECT
    bucket, primary_key, windowstart, etc
 FROM
    table AS t1
 WHERE
    windowstart=( SELECT max(windowstart) FROM table AS t2 WHERE
 bucket = t1.bucket )

 2. Using joins instead of subqueries ( in case Derby doesn't support
 subqueries - not sure about that )
 SELECT
    t1.bucket, t1.primary_key, windowstart, etc
 FROM
    table AS t1
    LEFT OUTER JOIN table AS t2 ON ( t1.bucket = t2.bucket AND
 t2.windowstart  t1.windowstart )
 WHERE
    t2.primary_key IS NULL

 HTH,
 Alex

 On Sat, Sep 18, 2010 at 2:28 PM, Karl Wright daddy...@gmail.com wrote:
 Hi Folks,

 For two of the report queries, ACF uses the following Postgresql
 construct, which sadly seems to have no Derby equivalent:

 SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount
 AS activitycount, t3.windowstart AS starttime, t3.windowend AS endtime
 FROM (...) t3

 In Postgresql, what this does is to return the FIRST entire row
 matching each distinct idbucket result.  If Derby had a FIRST()
 aggregate function, it would be the equivalent of:

 SELECT t3.bucket AS idbucket, FIRST(t3.activitycount) AS
 activitycount, FIRST(t3.windowstart) AS starttime, FIRST(t3.windowend)
 AS endtime FROM (...) 

Re: Derby SQL ideas needed

2010-09-19 Thread Karl Wright
Yes.  This is for the Max Activity and Max Bandwidth reports.
Karl

On Sun, Sep 19, 2010 at 2:13 PM, Alexey Serba ase...@gmail.com wrote:
 And all of this is only with single table repohistory, right? Is this
 some kind of complex analytics/stats?

 On Sun, Sep 19, 2010 at 8:48 PM, Karl Wright daddy...@gmail.com wrote:
 Here you go:

    // The query we will generate here looks like this:
    // SELECT *
    //   FROM
    //     (SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket,
 t3.bytecount AS bytecount,
    //                               t3.windowstart AS starttime,
 t3.windowend AS endtime
    //        FROM (SELECT * FROM (SELECT t0.bucket AS bucket,
 t0.starttime AS windowstart, t0.starttime + interval AS windowend,
    //                   SUM(t1.datasize * ((case when t0.starttime +
 interval  t1.endtime then t0.starttime + interval else t1.endtime
 end) -
    //                     (case when t0.starttimet1.starttime then
 t0.starttime else t1.starttime end))
    //                      / (t1.endtime - t1.starttime)) AS bytecount
    //                   FROM (SELECT DISTINCT substring(entityid from
 'bucketregexp') AS bucket, starttime FROM repohistory WHERE
 criteria) t0, repohistory t1
    //                   WHERE t0.bucket=substring(t1.entityid from
 'bucket_regexp')
    //                      AND t1.starttime  t0.starttime +
 interval AND t1.endtime  t0.starttime
    //                      AND criteria on t1
    //                          GROUP BY bucket,windowstart,windowend
    //              UNION SELECT t0a.bucket AS bucket, t0a.endtime -
 interval AS windowstart, t0a.endtime AS windowend,
    //                   SUM(t1a.datasize * ((case when t0a.endtime 
 t1a.endtime then t0a.endtime else t1a.endtime end) -
    //                     (case when t0a.endtime - interval 
 t1a.starttime then t0a.endtime - interval else t1a.starttime end))
    //                      / (t1a.endtime - t1a.starttime)) AS bytecount
    //                   FROM (SELECT DISTINCT substring(entityid from
 'bucketregexp') AS bucket, endtime FROM repohistory WHERE
 criteria) t0a, repohistory t1a
    //                   WHERE t0a.bucket=substring(t1a.entityid from
 'bucket_regexp')
    //                      AND (t1a.starttime  t0a.endtime AND
 t1a.endtime  t0a.endtime - interval
    //                      AND criteria on t1a
    //                          GROUP BY bucket,windowstart,windowend) t2
    //                              ORDER BY bucket ASC,bytecount
 DESC) t3) t4 ORDER BY xxx LIMIT yyy OFFSET zzz;

 I have low confidence that ANY planner would be able to locate the
 common part of a 2x larger query and not do it twice.

 Karl



 On Sun, Sep 19, 2010 at 12:05 PM, Alexey Serba ase...@gmail.com wrote:
 The other thing is that we cannot afford to use the same table
 twice, as it is actually an extremely expensive query in its own
 right, with multiple joins, select distinct's, etc. under the covers.
 Even if you create indexes on bucket and activitycount columns? It
 might be that the query plans for these two queries (with distinct
 on hack and subquery max/subquery order limit/join) would be the
 same.

 I'd be happy to post it but it may shock you. ;-)
 The way I indent SQL queries should say that I'm not afraid of
 multipage queries :)


 Karl





 On Sun, Sep 19, 2010 at 11:32 AM, Alexey Serba ase...@gmail.com wrote:
 SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS 
 activitycount, t3.windowstart AS starttime, t3.windowend AS endtime FROM 
 (...) t3
 Do you have primary key in your t3 table?

 In Postgresql, what this does is to return the FIRST entire row matching 
 each distinct idbucket result.
 FIRST based on which sort?

 Lets say you want to return FIRST row based on t3.windowstart column
 and you have primary key in t3 table. Then I believe your query can be
 rewritten in the following ways:

 1. Using subqueries
 SELECT
    bucket, primary_key, windowstart, etc
 FROM
    table AS t1
 WHERE
    windowstart=( SELECT max(windowstart) FROM table AS t2 WHERE
 bucket = t1.bucket )

 2. Using joins instead of subqueries ( in case Derby doesn't support
 subqueries - not sure about that )
 SELECT
    t1.bucket, t1.primary_key, windowstart, etc
 FROM
    table AS t1
    LEFT OUTER JOIN table AS t2 ON ( t1.bucket = t2.bucket AND
 t2.windowstart  t1.windowstart )
 WHERE
    t2.primary_key IS NULL

 HTH,
 Alex

 On Sat, Sep 18, 2010 at 2:28 PM, Karl Wright daddy...@gmail.com wrote:
 Hi Folks,

 For two of the report queries, ACF uses the following Postgresql
 construct, which sadly seems to have no Derby equivalent:

 SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount
 AS activitycount, t3.windowstart AS starttime, t3.windowend AS endtime
 FROM (...) t3

 In Postgresql, what this does is to return the FIRST entire row
 matching each distinct idbucket result.  If Derby had a FIRST()
 aggregate function, it would be the equivalent of:

 SELECT 

Re: Derby SQL ideas needed

2010-09-18 Thread Karl Wright
For what it's worth, defining a Derby function seems like the only way
to do it.  These seem to call arbitrary java that can accept a query
as an argument and return a resultset as the result.  But in order to
write such a thing I will need the ability to call Derby at a java
level, I think, rather than through JDBC.  Still looking for a good
example from somebody who has done something similar.

Karl

On Sat, Sep 18, 2010 at 6:28 AM, Karl Wright daddy...@gmail.com wrote:
 Hi Folks,

 For two of the report queries, ACF uses the following Postgresql
 construct, which sadly seems to have no Derby equivalent:

 SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount
 AS activitycount, t3.windowstart AS starttime, t3.windowend AS endtime
 FROM (...) t3

 In Postgresql, what this does is to return the FIRST entire row
 matching each distinct idbucket result.  If Derby had a FIRST()
 aggregate function, it would be the equivalent of:

 SELECT t3.bucket AS idbucket, FIRST(t3.activitycount) AS
 activitycount, FIRST(t3.windowstart) AS starttime, FIRST(t3.windowend)
 AS endtime FROM (...) t3 GROUP BY t3.bucket

 Unfortunately, Derby has no such aggregate function.  Furthermore, it
 would not be ideal if I were to do the work myself in ACF, because
 this is a resultset that needs to be paged through with offset and
 length, for presentation to the user and sorting, so it gets wrapped
 in another SELECT ... FROM (...) ORDER BY ... OFFSET ... LIMIT ...
 that does that part.

 Does anyone have any ideas and/or Derby contacts?  I'd really like the
 quick-start example to have a functional set of reports.

 Karl



Re: Derby SQL ideas needed

2010-09-18 Thread Karl Wright
The Derby table-result function syntax requires all output columns to
be declared as part of the function definition, and more importantly
it does not seem to allow calls into Derby itself to get results.  So
this would not seem to be a viable option for that reason.

Back to square 1, I guess.  Derby doesn't seem to allow any way to
declare aggregate functions either, so I couldn't declare a FIRST()
aggregate method as proposed below.  Simple arithmetic functions seem
like they would work, but that's not helpful here.

Karl



On Sat, Sep 18, 2010 at 6:45 AM, Karl Wright daddy...@gmail.com wrote:
 For what it's worth, defining a Derby function seems like the only way
 to do it.  These seem to call arbitrary java that can accept a query
 as an argument and return a resultset as the result.  But in order to
 write such a thing I will need the ability to call Derby at a java
 level, I think, rather than through JDBC.  Still looking for a good
 example from somebody who has done something similar.

 Karl

 On Sat, Sep 18, 2010 at 6:28 AM, Karl Wright daddy...@gmail.com wrote:
 Hi Folks,

 For two of the report queries, ACF uses the following Postgresql
 construct, which sadly seems to have no Derby equivalent:

 SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount
 AS activitycount, t3.windowstart AS starttime, t3.windowend AS endtime
 FROM (...) t3

 In Postgresql, what this does is to return the FIRST entire row
 matching each distinct idbucket result.  If Derby had a FIRST()
 aggregate function, it would be the equivalent of:

 SELECT t3.bucket AS idbucket, FIRST(t3.activitycount) AS
 activitycount, FIRST(t3.windowstart) AS starttime, FIRST(t3.windowend)
 AS endtime FROM (...) t3 GROUP BY t3.bucket

 Unfortunately, Derby has no such aggregate function.  Furthermore, it
 would not be ideal if I were to do the work myself in ACF, because
 this is a resultset that needs to be paged through with offset and
 length, for presentation to the user and sorting, so it gets wrapped
 in another SELECT ... FROM (...) ORDER BY ... OFFSET ... LIMIT ...
 that does that part.

 Does anyone have any ideas and/or Derby contacts?  I'd really like the
 quick-start example to have a functional set of reports.

 Karl