Re: Derby SQL ideas needed
SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS activitycount, t3.windowstart AS starttime, t3.windowend AS endtime FROM (...) t3 Do you have primary key in your t3 table? In Postgresql, what this does is to return the FIRST entire row matching each distinct idbucket result. FIRST based on which sort? Lets say you want to return FIRST row based on t3.windowstart column and you have primary key in t3 table. Then I believe your query can be rewritten in the following ways: 1. Using subqueries SELECT bucket, primary_key, windowstart, etc FROM table AS t1 WHERE windowstart=( SELECT max(windowstart) FROM table AS t2 WHERE bucket = t1.bucket ) 2. Using joins instead of subqueries ( in case Derby doesn't support subqueries - not sure about that ) SELECT t1.bucket, t1.primary_key, windowstart, etc FROM table AS t1 LEFT OUTER JOIN table AS t2 ON ( t1.bucket = t2.bucket AND t2.windowstart t1.windowstart ) WHERE t2.primary_key IS NULL HTH, Alex On Sat, Sep 18, 2010 at 2:28 PM, Karl Wright daddy...@gmail.com wrote: Hi Folks, For two of the report queries, ACF uses the following Postgresql construct, which sadly seems to have no Derby equivalent: SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS activitycount, t3.windowstart AS starttime, t3.windowend AS endtime FROM (...) t3 In Postgresql, what this does is to return the FIRST entire row matching each distinct idbucket result. If Derby had a FIRST() aggregate function, it would be the equivalent of: SELECT t3.bucket AS idbucket, FIRST(t3.activitycount) AS activitycount, FIRST(t3.windowstart) AS starttime, FIRST(t3.windowend) AS endtime FROM (...) t3 GROUP BY t3.bucket Unfortunately, Derby has no such aggregate function. Furthermore, it would not be ideal if I were to do the work myself in ACF, because this is a resultset that needs to be paged through with offset and length, for presentation to the user and sorting, so it gets wrapped in another SELECT ... FROM (...) ORDER BY ... OFFSET ... LIMIT ... that does that part. Does anyone have any ideas and/or Derby contacts? I'd really like the quick-start example to have a functional set of reports. Karl
Re: Derby SQL ideas needed
FIRST based on which sort?? First based on the existing sort, which is crucial, because the sort is by bucket ASC, activitycount DESC. I'm looking for the row with the highest activitycount, per bucket. The other thing is that we cannot afford to use the same table twice, as it is actually an extremely expensive query in its own right, with multiple joins, select distinct's, etc. under the covers. I'd be happy to post it but it may shock you. ;-) Karl On Sun, Sep 19, 2010 at 11:32 AM, Alexey Serba ase...@gmail.com wrote: SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS activitycount, t3.windowstart AS starttime, t3.windowend AS endtime FROM (...) t3 Do you have primary key in your t3 table? In Postgresql, what this does is to return the FIRST entire row matching each distinct idbucket result. FIRST based on which sort? Lets say you want to return FIRST row based on t3.windowstart column and you have primary key in t3 table. Then I believe your query can be rewritten in the following ways: 1. Using subqueries SELECT bucket, primary_key, windowstart, etc FROM table AS t1 WHERE windowstart=( SELECT max(windowstart) FROM table AS t2 WHERE bucket = t1.bucket ) 2. Using joins instead of subqueries ( in case Derby doesn't support subqueries - not sure about that ) SELECT t1.bucket, t1.primary_key, windowstart, etc FROM table AS t1 LEFT OUTER JOIN table AS t2 ON ( t1.bucket = t2.bucket AND t2.windowstart t1.windowstart ) WHERE t2.primary_key IS NULL HTH, Alex On Sat, Sep 18, 2010 at 2:28 PM, Karl Wright daddy...@gmail.com wrote: Hi Folks, For two of the report queries, ACF uses the following Postgresql construct, which sadly seems to have no Derby equivalent: SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS activitycount, t3.windowstart AS starttime, t3.windowend AS endtime FROM (...) t3 In Postgresql, what this does is to return the FIRST entire row matching each distinct idbucket result. If Derby had a FIRST() aggregate function, it would be the equivalent of: SELECT t3.bucket AS idbucket, FIRST(t3.activitycount) AS activitycount, FIRST(t3.windowstart) AS starttime, FIRST(t3.windowend) AS endtime FROM (...) t3 GROUP BY t3.bucket Unfortunately, Derby has no such aggregate function. Furthermore, it would not be ideal if I were to do the work myself in ACF, because this is a resultset that needs to be paged through with offset and length, for presentation to the user and sorting, so it gets wrapped in another SELECT ... FROM (...) ORDER BY ... OFFSET ... LIMIT ... that does that part. Does anyone have any ideas and/or Derby contacts? I'd really like the quick-start example to have a functional set of reports. Karl
Re: Derby SQL ideas needed
The other thing is that we cannot afford to use the same table twice, as it is actually an extremely expensive query in its own right, with multiple joins, select distinct's, etc. under the covers. Even if you create indexes on bucket and activitycount columns? It might be that the query plans for these two queries (with distinct on hack and subquery max/subquery order limit/join) would be the same. I'd be happy to post it but it may shock you. ;-) The way I indent SQL queries should say that I'm not afraid of multipage queries :) Karl On Sun, Sep 19, 2010 at 11:32 AM, Alexey Serba ase...@gmail.com wrote: SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS activitycount, t3.windowstart AS starttime, t3.windowend AS endtime FROM (...) t3 Do you have primary key in your t3 table? In Postgresql, what this does is to return the FIRST entire row matching each distinct idbucket result. FIRST based on which sort? Lets say you want to return FIRST row based on t3.windowstart column and you have primary key in t3 table. Then I believe your query can be rewritten in the following ways: 1. Using subqueries SELECT bucket, primary_key, windowstart, etc FROM table AS t1 WHERE windowstart=( SELECT max(windowstart) FROM table AS t2 WHERE bucket = t1.bucket ) 2. Using joins instead of subqueries ( in case Derby doesn't support subqueries - not sure about that ) SELECT t1.bucket, t1.primary_key, windowstart, etc FROM table AS t1 LEFT OUTER JOIN table AS t2 ON ( t1.bucket = t2.bucket AND t2.windowstart t1.windowstart ) WHERE t2.primary_key IS NULL HTH, Alex On Sat, Sep 18, 2010 at 2:28 PM, Karl Wright daddy...@gmail.com wrote: Hi Folks, For two of the report queries, ACF uses the following Postgresql construct, which sadly seems to have no Derby equivalent: SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS activitycount, t3.windowstart AS starttime, t3.windowend AS endtime FROM (...) t3 In Postgresql, what this does is to return the FIRST entire row matching each distinct idbucket result. If Derby had a FIRST() aggregate function, it would be the equivalent of: SELECT t3.bucket AS idbucket, FIRST(t3.activitycount) AS activitycount, FIRST(t3.windowstart) AS starttime, FIRST(t3.windowend) AS endtime FROM (...) t3 GROUP BY t3.bucket Unfortunately, Derby has no such aggregate function. Furthermore, it would not be ideal if I were to do the work myself in ACF, because this is a resultset that needs to be paged through with offset and length, for presentation to the user and sorting, so it gets wrapped in another SELECT ... FROM (...) ORDER BY ... OFFSET ... LIMIT ... that does that part. Does anyone have any ideas and/or Derby contacts? I'd really like the quick-start example to have a functional set of reports. Karl
Re: Derby SQL ideas needed
Here you go: // The query we will generate here looks like this: // SELECT * // FROM // (SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.bytecount AS bytecount, // t3.windowstart AS starttime, t3.windowend AS endtime //FROM (SELECT * FROM (SELECT t0.bucket AS bucket, t0.starttime AS windowstart, t0.starttime + interval AS windowend, // SUM(t1.datasize * ((case when t0.starttime + interval t1.endtime then t0.starttime + interval else t1.endtime end) - // (case when t0.starttimet1.starttime then t0.starttime else t1.starttime end)) // / (t1.endtime - t1.starttime)) AS bytecount // FROM (SELECT DISTINCT substring(entityid from 'bucketregexp') AS bucket, starttime FROM repohistory WHERE criteria) t0, repohistory t1 // WHERE t0.bucket=substring(t1.entityid from 'bucket_regexp') // AND t1.starttime t0.starttime + interval AND t1.endtime t0.starttime // AND criteria on t1 // GROUP BY bucket,windowstart,windowend // UNION SELECT t0a.bucket AS bucket, t0a.endtime - interval AS windowstart, t0a.endtime AS windowend, // SUM(t1a.datasize * ((case when t0a.endtime t1a.endtime then t0a.endtime else t1a.endtime end) - // (case when t0a.endtime - interval t1a.starttime then t0a.endtime - interval else t1a.starttime end)) // / (t1a.endtime - t1a.starttime)) AS bytecount // FROM (SELECT DISTINCT substring(entityid from 'bucketregexp') AS bucket, endtime FROM repohistory WHERE criteria) t0a, repohistory t1a // WHERE t0a.bucket=substring(t1a.entityid from 'bucket_regexp') // AND (t1a.starttime t0a.endtime AND t1a.endtime t0a.endtime - interval // AND criteria on t1a // GROUP BY bucket,windowstart,windowend) t2 // ORDER BY bucket ASC,bytecount DESC) t3) t4 ORDER BY xxx LIMIT yyy OFFSET zzz; I have low confidence that ANY planner would be able to locate the common part of a 2x larger query and not do it twice. Karl On Sun, Sep 19, 2010 at 12:05 PM, Alexey Serba ase...@gmail.com wrote: The other thing is that we cannot afford to use the same table twice, as it is actually an extremely expensive query in its own right, with multiple joins, select distinct's, etc. under the covers. Even if you create indexes on bucket and activitycount columns? It might be that the query plans for these two queries (with distinct on hack and subquery max/subquery order limit/join) would be the same. I'd be happy to post it but it may shock you. ;-) The way I indent SQL queries should say that I'm not afraid of multipage queries :) Karl On Sun, Sep 19, 2010 at 11:32 AM, Alexey Serba ase...@gmail.com wrote: SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS activitycount, t3.windowstart AS starttime, t3.windowend AS endtime FROM (...) t3 Do you have primary key in your t3 table? In Postgresql, what this does is to return the FIRST entire row matching each distinct idbucket result. FIRST based on which sort? Lets say you want to return FIRST row based on t3.windowstart column and you have primary key in t3 table. Then I believe your query can be rewritten in the following ways: 1. Using subqueries SELECT bucket, primary_key, windowstart, etc FROM table AS t1 WHERE windowstart=( SELECT max(windowstart) FROM table AS t2 WHERE bucket = t1.bucket ) 2. Using joins instead of subqueries ( in case Derby doesn't support subqueries - not sure about that ) SELECT t1.bucket, t1.primary_key, windowstart, etc FROM table AS t1 LEFT OUTER JOIN table AS t2 ON ( t1.bucket = t2.bucket AND t2.windowstart t1.windowstart ) WHERE t2.primary_key IS NULL HTH, Alex On Sat, Sep 18, 2010 at 2:28 PM, Karl Wright daddy...@gmail.com wrote: Hi Folks, For two of the report queries, ACF uses the following Postgresql construct, which sadly seems to have no Derby equivalent: SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS activitycount, t3.windowstart AS starttime, t3.windowend AS endtime FROM (...) t3 In Postgresql, what this does is to return the FIRST entire row matching each distinct idbucket result. If Derby had a FIRST() aggregate function, it would be the equivalent of: SELECT t3.bucket AS idbucket, FIRST(t3.activitycount) AS activitycount, FIRST(t3.windowstart) AS starttime, FIRST(t3.windowend) AS endtime FROM (...) t3 GROUP BY t3.bucket Unfortunately, Derby has no such aggregate function. Furthermore, it would not be ideal if I were to do the work myself in ACF, because this is a resultset that needs to be
Re: Derby SQL ideas needed
Looking at your proposal: SELECT bucket, primary_key, windowstart, etc FROM table AS t1 WHERE windowstart=( SELECT max(windowstart) FROM table AS t2 WHERE bucket = t1.bucket ) ... we'd be looking actually for something more like this: SELECT t1.bucket, t1.bytecount, t1.windowstart, t1.windowend FROM (xxx) t1 WHERE t1.bytecount=( SELECT max(t2.bytecount) FROM (xxx) t2 WHERE t2.bucket = t1.bucket ) ... although I've never seen the =(SELECT...) structure before. Karl On Sun, Sep 19, 2010 at 12:48 PM, Karl Wright daddy...@gmail.com wrote: Here you go: // The query we will generate here looks like this: // SELECT * // FROM // (SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.bytecount AS bytecount, // t3.windowstart AS starttime, t3.windowend AS endtime // FROM (SELECT * FROM (SELECT t0.bucket AS bucket, t0.starttime AS windowstart, t0.starttime + interval AS windowend, // SUM(t1.datasize * ((case when t0.starttime + interval t1.endtime then t0.starttime + interval else t1.endtime end) - // (case when t0.starttimet1.starttime then t0.starttime else t1.starttime end)) // / (t1.endtime - t1.starttime)) AS bytecount // FROM (SELECT DISTINCT substring(entityid from 'bucketregexp') AS bucket, starttime FROM repohistory WHERE criteria) t0, repohistory t1 // WHERE t0.bucket=substring(t1.entityid from 'bucket_regexp') // AND t1.starttime t0.starttime + interval AND t1.endtime t0.starttime // AND criteria on t1 // GROUP BY bucket,windowstart,windowend // UNION SELECT t0a.bucket AS bucket, t0a.endtime - interval AS windowstart, t0a.endtime AS windowend, // SUM(t1a.datasize * ((case when t0a.endtime t1a.endtime then t0a.endtime else t1a.endtime end) - // (case when t0a.endtime - interval t1a.starttime then t0a.endtime - interval else t1a.starttime end)) // / (t1a.endtime - t1a.starttime)) AS bytecount // FROM (SELECT DISTINCT substring(entityid from 'bucketregexp') AS bucket, endtime FROM repohistory WHERE criteria) t0a, repohistory t1a // WHERE t0a.bucket=substring(t1a.entityid from 'bucket_regexp') // AND (t1a.starttime t0a.endtime AND t1a.endtime t0a.endtime - interval // AND criteria on t1a // GROUP BY bucket,windowstart,windowend) t2 // ORDER BY bucket ASC,bytecount DESC) t3) t4 ORDER BY xxx LIMIT yyy OFFSET zzz; I have low confidence that ANY planner would be able to locate the common part of a 2x larger query and not do it twice. Karl On Sun, Sep 19, 2010 at 12:05 PM, Alexey Serba ase...@gmail.com wrote: The other thing is that we cannot afford to use the same table twice, as it is actually an extremely expensive query in its own right, with multiple joins, select distinct's, etc. under the covers. Even if you create indexes on bucket and activitycount columns? It might be that the query plans for these two queries (with distinct on hack and subquery max/subquery order limit/join) would be the same. I'd be happy to post it but it may shock you. ;-) The way I indent SQL queries should say that I'm not afraid of multipage queries :) Karl On Sun, Sep 19, 2010 at 11:32 AM, Alexey Serba ase...@gmail.com wrote: SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS activitycount, t3.windowstart AS starttime, t3.windowend AS endtime FROM (...) t3 Do you have primary key in your t3 table? In Postgresql, what this does is to return the FIRST entire row matching each distinct idbucket result. FIRST based on which sort? Lets say you want to return FIRST row based on t3.windowstart column and you have primary key in t3 table. Then I believe your query can be rewritten in the following ways: 1. Using subqueries SELECT bucket, primary_key, windowstart, etc FROM table AS t1 WHERE windowstart=( SELECT max(windowstart) FROM table AS t2 WHERE bucket = t1.bucket ) 2. Using joins instead of subqueries ( in case Derby doesn't support subqueries - not sure about that ) SELECT t1.bucket, t1.primary_key, windowstart, etc FROM table AS t1 LEFT OUTER JOIN table AS t2 ON ( t1.bucket = t2.bucket AND t2.windowstart t1.windowstart ) WHERE t2.primary_key IS NULL HTH, Alex On Sat, Sep 18, 2010 at 2:28 PM, Karl Wright daddy...@gmail.com wrote: Hi Folks, For two of the report queries, ACF uses the following Postgresql construct, which sadly seems to have no Derby equivalent: SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS activitycount, t3.windowstart AS
Re: Derby SQL ideas needed
You can also try ORDER BY bytecount DESC LIMIT 1 instead of aggregate function max, i.e. SELECT t1.bucket, t1.bytecount, t1.windowstart, t1.windowend FROM (xxx) t1 WHERE t1.bytecount=( SELECT t2.bytecount FROM (xxx) t2 WHERE t2.bucket = t1.bucket ORDER BY t2.bytecount DESC LIMIT 1 ) On Sun, Sep 19, 2010 at 9:07 PM, Karl Wright daddy...@gmail.com wrote: Looking at your proposal: SELECT bucket, primary_key, windowstart, etc FROM table AS t1 WHERE windowstart=( SELECT max(windowstart) FROM table AS t2 WHERE bucket = t1.bucket ) ... we'd be looking actually for something more like this: SELECT t1.bucket, t1.bytecount, t1.windowstart, t1.windowend FROM (xxx) t1 WHERE t1.bytecount=( SELECT max(t2.bytecount) FROM (xxx) t2 WHERE t2.bucket = t1.bucket ) ... although I've never seen the =(SELECT...) structure before. Karl On Sun, Sep 19, 2010 at 12:48 PM, Karl Wright daddy...@gmail.com wrote: Here you go: // The query we will generate here looks like this: // SELECT * // FROM // (SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.bytecount AS bytecount, // t3.windowstart AS starttime, t3.windowend AS endtime // FROM (SELECT * FROM (SELECT t0.bucket AS bucket, t0.starttime AS windowstart, t0.starttime + interval AS windowend, // SUM(t1.datasize * ((case when t0.starttime + interval t1.endtime then t0.starttime + interval else t1.endtime end) - // (case when t0.starttimet1.starttime then t0.starttime else t1.starttime end)) // / (t1.endtime - t1.starttime)) AS bytecount // FROM (SELECT DISTINCT substring(entityid from 'bucketregexp') AS bucket, starttime FROM repohistory WHERE criteria) t0, repohistory t1 // WHERE t0.bucket=substring(t1.entityid from 'bucket_regexp') // AND t1.starttime t0.starttime + interval AND t1.endtime t0.starttime // AND criteria on t1 // GROUP BY bucket,windowstart,windowend // UNION SELECT t0a.bucket AS bucket, t0a.endtime - interval AS windowstart, t0a.endtime AS windowend, // SUM(t1a.datasize * ((case when t0a.endtime t1a.endtime then t0a.endtime else t1a.endtime end) - // (case when t0a.endtime - interval t1a.starttime then t0a.endtime - interval else t1a.starttime end)) // / (t1a.endtime - t1a.starttime)) AS bytecount // FROM (SELECT DISTINCT substring(entityid from 'bucketregexp') AS bucket, endtime FROM repohistory WHERE criteria) t0a, repohistory t1a // WHERE t0a.bucket=substring(t1a.entityid from 'bucket_regexp') // AND (t1a.starttime t0a.endtime AND t1a.endtime t0a.endtime - interval // AND criteria on t1a // GROUP BY bucket,windowstart,windowend) t2 // ORDER BY bucket ASC,bytecount DESC) t3) t4 ORDER BY xxx LIMIT yyy OFFSET zzz; I have low confidence that ANY planner would be able to locate the common part of a 2x larger query and not do it twice. Karl On Sun, Sep 19, 2010 at 12:05 PM, Alexey Serba ase...@gmail.com wrote: The other thing is that we cannot afford to use the same table twice, as it is actually an extremely expensive query in its own right, with multiple joins, select distinct's, etc. under the covers. Even if you create indexes on bucket and activitycount columns? It might be that the query plans for these two queries (with distinct on hack and subquery max/subquery order limit/join) would be the same. I'd be happy to post it but it may shock you. ;-) The way I indent SQL queries should say that I'm not afraid of multipage queries :) Karl On Sun, Sep 19, 2010 at 11:32 AM, Alexey Serba ase...@gmail.com wrote: SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS activitycount, t3.windowstart AS starttime, t3.windowend AS endtime FROM (...) t3 Do you have primary key in your t3 table? In Postgresql, what this does is to return the FIRST entire row matching each distinct idbucket result. FIRST based on which sort? Lets say you want to return FIRST row based on t3.windowstart column and you have primary key in t3 table. Then I believe your query can be rewritten in the following ways: 1. Using subqueries SELECT bucket, primary_key, windowstart, etc FROM table AS t1 WHERE windowstart=( SELECT max(windowstart) FROM table AS t2 WHERE bucket = t1.bucket ) 2. Using joins instead of subqueries ( in case Derby doesn't support subqueries - not sure about that ) SELECT t1.bucket, t1.primary_key, windowstart, etc FROM table AS t1 LEFT OUTER JOIN table AS t2 ON ( t1.bucket = t2.bucket AND t2.windowstart
Re: Derby SQL ideas needed
And all of this is only with single table repohistory, right? Is this some kind of complex analytics/stats? On Sun, Sep 19, 2010 at 8:48 PM, Karl Wright daddy...@gmail.com wrote: Here you go: // The query we will generate here looks like this: // SELECT * // FROM // (SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.bytecount AS bytecount, // t3.windowstart AS starttime, t3.windowend AS endtime // FROM (SELECT * FROM (SELECT t0.bucket AS bucket, t0.starttime AS windowstart, t0.starttime + interval AS windowend, // SUM(t1.datasize * ((case when t0.starttime + interval t1.endtime then t0.starttime + interval else t1.endtime end) - // (case when t0.starttimet1.starttime then t0.starttime else t1.starttime end)) // / (t1.endtime - t1.starttime)) AS bytecount // FROM (SELECT DISTINCT substring(entityid from 'bucketregexp') AS bucket, starttime FROM repohistory WHERE criteria) t0, repohistory t1 // WHERE t0.bucket=substring(t1.entityid from 'bucket_regexp') // AND t1.starttime t0.starttime + interval AND t1.endtime t0.starttime // AND criteria on t1 // GROUP BY bucket,windowstart,windowend // UNION SELECT t0a.bucket AS bucket, t0a.endtime - interval AS windowstart, t0a.endtime AS windowend, // SUM(t1a.datasize * ((case when t0a.endtime t1a.endtime then t0a.endtime else t1a.endtime end) - // (case when t0a.endtime - interval t1a.starttime then t0a.endtime - interval else t1a.starttime end)) // / (t1a.endtime - t1a.starttime)) AS bytecount // FROM (SELECT DISTINCT substring(entityid from 'bucketregexp') AS bucket, endtime FROM repohistory WHERE criteria) t0a, repohistory t1a // WHERE t0a.bucket=substring(t1a.entityid from 'bucket_regexp') // AND (t1a.starttime t0a.endtime AND t1a.endtime t0a.endtime - interval // AND criteria on t1a // GROUP BY bucket,windowstart,windowend) t2 // ORDER BY bucket ASC,bytecount DESC) t3) t4 ORDER BY xxx LIMIT yyy OFFSET zzz; I have low confidence that ANY planner would be able to locate the common part of a 2x larger query and not do it twice. Karl On Sun, Sep 19, 2010 at 12:05 PM, Alexey Serba ase...@gmail.com wrote: The other thing is that we cannot afford to use the same table twice, as it is actually an extremely expensive query in its own right, with multiple joins, select distinct's, etc. under the covers. Even if you create indexes on bucket and activitycount columns? It might be that the query plans for these two queries (with distinct on hack and subquery max/subquery order limit/join) would be the same. I'd be happy to post it but it may shock you. ;-) The way I indent SQL queries should say that I'm not afraid of multipage queries :) Karl On Sun, Sep 19, 2010 at 11:32 AM, Alexey Serba ase...@gmail.com wrote: SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS activitycount, t3.windowstart AS starttime, t3.windowend AS endtime FROM (...) t3 Do you have primary key in your t3 table? In Postgresql, what this does is to return the FIRST entire row matching each distinct idbucket result. FIRST based on which sort? Lets say you want to return FIRST row based on t3.windowstart column and you have primary key in t3 table. Then I believe your query can be rewritten in the following ways: 1. Using subqueries SELECT bucket, primary_key, windowstart, etc FROM table AS t1 WHERE windowstart=( SELECT max(windowstart) FROM table AS t2 WHERE bucket = t1.bucket ) 2. Using joins instead of subqueries ( in case Derby doesn't support subqueries - not sure about that ) SELECT t1.bucket, t1.primary_key, windowstart, etc FROM table AS t1 LEFT OUTER JOIN table AS t2 ON ( t1.bucket = t2.bucket AND t2.windowstart t1.windowstart ) WHERE t2.primary_key IS NULL HTH, Alex On Sat, Sep 18, 2010 at 2:28 PM, Karl Wright daddy...@gmail.com wrote: Hi Folks, For two of the report queries, ACF uses the following Postgresql construct, which sadly seems to have no Derby equivalent: SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS activitycount, t3.windowstart AS starttime, t3.windowend AS endtime FROM (...) t3 In Postgresql, what this does is to return the FIRST entire row matching each distinct idbucket result. If Derby had a FIRST() aggregate function, it would be the equivalent of: SELECT t3.bucket AS idbucket, FIRST(t3.activitycount) AS activitycount, FIRST(t3.windowstart) AS starttime, FIRST(t3.windowend) AS endtime FROM (...)
Re: Derby SQL ideas needed
Yes. This is for the Max Activity and Max Bandwidth reports. Karl On Sun, Sep 19, 2010 at 2:13 PM, Alexey Serba ase...@gmail.com wrote: And all of this is only with single table repohistory, right? Is this some kind of complex analytics/stats? On Sun, Sep 19, 2010 at 8:48 PM, Karl Wright daddy...@gmail.com wrote: Here you go: // The query we will generate here looks like this: // SELECT * // FROM // (SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.bytecount AS bytecount, // t3.windowstart AS starttime, t3.windowend AS endtime // FROM (SELECT * FROM (SELECT t0.bucket AS bucket, t0.starttime AS windowstart, t0.starttime + interval AS windowend, // SUM(t1.datasize * ((case when t0.starttime + interval t1.endtime then t0.starttime + interval else t1.endtime end) - // (case when t0.starttimet1.starttime then t0.starttime else t1.starttime end)) // / (t1.endtime - t1.starttime)) AS bytecount // FROM (SELECT DISTINCT substring(entityid from 'bucketregexp') AS bucket, starttime FROM repohistory WHERE criteria) t0, repohistory t1 // WHERE t0.bucket=substring(t1.entityid from 'bucket_regexp') // AND t1.starttime t0.starttime + interval AND t1.endtime t0.starttime // AND criteria on t1 // GROUP BY bucket,windowstart,windowend // UNION SELECT t0a.bucket AS bucket, t0a.endtime - interval AS windowstart, t0a.endtime AS windowend, // SUM(t1a.datasize * ((case when t0a.endtime t1a.endtime then t0a.endtime else t1a.endtime end) - // (case when t0a.endtime - interval t1a.starttime then t0a.endtime - interval else t1a.starttime end)) // / (t1a.endtime - t1a.starttime)) AS bytecount // FROM (SELECT DISTINCT substring(entityid from 'bucketregexp') AS bucket, endtime FROM repohistory WHERE criteria) t0a, repohistory t1a // WHERE t0a.bucket=substring(t1a.entityid from 'bucket_regexp') // AND (t1a.starttime t0a.endtime AND t1a.endtime t0a.endtime - interval // AND criteria on t1a // GROUP BY bucket,windowstart,windowend) t2 // ORDER BY bucket ASC,bytecount DESC) t3) t4 ORDER BY xxx LIMIT yyy OFFSET zzz; I have low confidence that ANY planner would be able to locate the common part of a 2x larger query and not do it twice. Karl On Sun, Sep 19, 2010 at 12:05 PM, Alexey Serba ase...@gmail.com wrote: The other thing is that we cannot afford to use the same table twice, as it is actually an extremely expensive query in its own right, with multiple joins, select distinct's, etc. under the covers. Even if you create indexes on bucket and activitycount columns? It might be that the query plans for these two queries (with distinct on hack and subquery max/subquery order limit/join) would be the same. I'd be happy to post it but it may shock you. ;-) The way I indent SQL queries should say that I'm not afraid of multipage queries :) Karl On Sun, Sep 19, 2010 at 11:32 AM, Alexey Serba ase...@gmail.com wrote: SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS activitycount, t3.windowstart AS starttime, t3.windowend AS endtime FROM (...) t3 Do you have primary key in your t3 table? In Postgresql, what this does is to return the FIRST entire row matching each distinct idbucket result. FIRST based on which sort? Lets say you want to return FIRST row based on t3.windowstart column and you have primary key in t3 table. Then I believe your query can be rewritten in the following ways: 1. Using subqueries SELECT bucket, primary_key, windowstart, etc FROM table AS t1 WHERE windowstart=( SELECT max(windowstart) FROM table AS t2 WHERE bucket = t1.bucket ) 2. Using joins instead of subqueries ( in case Derby doesn't support subqueries - not sure about that ) SELECT t1.bucket, t1.primary_key, windowstart, etc FROM table AS t1 LEFT OUTER JOIN table AS t2 ON ( t1.bucket = t2.bucket AND t2.windowstart t1.windowstart ) WHERE t2.primary_key IS NULL HTH, Alex On Sat, Sep 18, 2010 at 2:28 PM, Karl Wright daddy...@gmail.com wrote: Hi Folks, For two of the report queries, ACF uses the following Postgresql construct, which sadly seems to have no Derby equivalent: SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS activitycount, t3.windowstart AS starttime, t3.windowend AS endtime FROM (...) t3 In Postgresql, what this does is to return the FIRST entire row matching each distinct idbucket result. If Derby had a FIRST() aggregate function, it would be the equivalent of: SELECT
Re: Derby SQL ideas needed
For what it's worth, defining a Derby function seems like the only way to do it. These seem to call arbitrary java that can accept a query as an argument and return a resultset as the result. But in order to write such a thing I will need the ability to call Derby at a java level, I think, rather than through JDBC. Still looking for a good example from somebody who has done something similar. Karl On Sat, Sep 18, 2010 at 6:28 AM, Karl Wright daddy...@gmail.com wrote: Hi Folks, For two of the report queries, ACF uses the following Postgresql construct, which sadly seems to have no Derby equivalent: SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS activitycount, t3.windowstart AS starttime, t3.windowend AS endtime FROM (...) t3 In Postgresql, what this does is to return the FIRST entire row matching each distinct idbucket result. If Derby had a FIRST() aggregate function, it would be the equivalent of: SELECT t3.bucket AS idbucket, FIRST(t3.activitycount) AS activitycount, FIRST(t3.windowstart) AS starttime, FIRST(t3.windowend) AS endtime FROM (...) t3 GROUP BY t3.bucket Unfortunately, Derby has no such aggregate function. Furthermore, it would not be ideal if I were to do the work myself in ACF, because this is a resultset that needs to be paged through with offset and length, for presentation to the user and sorting, so it gets wrapped in another SELECT ... FROM (...) ORDER BY ... OFFSET ... LIMIT ... that does that part. Does anyone have any ideas and/or Derby contacts? I'd really like the quick-start example to have a functional set of reports. Karl
Re: Derby SQL ideas needed
The Derby table-result function syntax requires all output columns to be declared as part of the function definition, and more importantly it does not seem to allow calls into Derby itself to get results. So this would not seem to be a viable option for that reason. Back to square 1, I guess. Derby doesn't seem to allow any way to declare aggregate functions either, so I couldn't declare a FIRST() aggregate method as proposed below. Simple arithmetic functions seem like they would work, but that's not helpful here. Karl On Sat, Sep 18, 2010 at 6:45 AM, Karl Wright daddy...@gmail.com wrote: For what it's worth, defining a Derby function seems like the only way to do it. These seem to call arbitrary java that can accept a query as an argument and return a resultset as the result. But in order to write such a thing I will need the ability to call Derby at a java level, I think, rather than through JDBC. Still looking for a good example from somebody who has done something similar. Karl On Sat, Sep 18, 2010 at 6:28 AM, Karl Wright daddy...@gmail.com wrote: Hi Folks, For two of the report queries, ACF uses the following Postgresql construct, which sadly seems to have no Derby equivalent: SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS activitycount, t3.windowstart AS starttime, t3.windowend AS endtime FROM (...) t3 In Postgresql, what this does is to return the FIRST entire row matching each distinct idbucket result. If Derby had a FIRST() aggregate function, it would be the equivalent of: SELECT t3.bucket AS idbucket, FIRST(t3.activitycount) AS activitycount, FIRST(t3.windowstart) AS starttime, FIRST(t3.windowend) AS endtime FROM (...) t3 GROUP BY t3.bucket Unfortunately, Derby has no such aggregate function. Furthermore, it would not be ideal if I were to do the work myself in ACF, because this is a resultset that needs to be paged through with offset and length, for presentation to the user and sorting, so it gets wrapped in another SELECT ... FROM (...) ORDER BY ... OFFSET ... LIMIT ... that does that part. Does anyone have any ideas and/or Derby contacts? I'd really like the quick-start example to have a functional set of reports. Karl