Re: Derby SQL ideas needed

Karl Wright Sun, 19 Sep 2010 12:11:07 -0700

Yes.  This is for the Max Activity and Max Bandwidth reports.
Karl


On Sun, Sep 19, 2010 at 2:13 PM, Alexey Serba <ase...@gmail.com> wrote:
> And all of this is only with single table repohistory, right? Is this
> some kind of complex analytics/stats?
>
> On Sun, Sep 19, 2010 at 8:48 PM, Karl Wright <daddy...@gmail.com> wrote:
>> Here you go:
>>
>>    // The query we will generate here looks like this:
>>    // SELECT *
>>    //   FROM
>>    //     (SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket,
>> t3.bytecount AS bytecount,
>>    //                               t3.windowstart AS starttime,
>> t3.windowend AS endtime
>>    //        FROM (SELECT * FROM (SELECT t0.bucket AS bucket,
>> t0.starttime AS windowstart, t0.starttime + <interval> AS windowend,
>>    //                   SUM(t1.datasize * ((case when t0.starttime +
>> <interval> < t1.endtime then t0.starttime + <interval> else t1.endtime
>> end) -
>>    //                     (case when t0.starttime>t1.starttime then
>> t0.starttime else t1.starttime end))
>>    //                      / (t1.endtime - t1.starttime)) AS bytecount
>>    //                   FROM (SELECT DISTINCT substring(entityid from
>> '<bucketregexp>') AS bucket, starttime FROM repohistory WHERE
>> <criteria>) t0, repohistory t1
>>    //                   WHERE t0.bucket=substring(t1.entityid from
>> '<bucket_regexp>')
>>    //                      AND t1.starttime < t0.starttime +
>> <interval> AND t1.endtime > t0.starttime
>>    //                      AND <criteria on t1>
>>    //                          GROUP BY bucket,windowstart,windowend
>>    //              UNION SELECT t0a.bucket AS bucket, t0a.endtime -
>> <interval> AS windowstart, t0a.endtime AS windowend,
>>    //                   SUM(t1a.datasize * ((case when t0a.endtime <
>> t1a.endtime then t0a.endtime else t1a.endtime end) -
>>    //                     (case when t0a.endtime - <interval> >
>> t1a.starttime then t0a.endtime - <interval> else t1a.starttime end))
>>    //                      / (t1a.endtime - t1a.starttime)) AS bytecount
>>    //                   FROM (SELECT DISTINCT substring(entityid from
>> '<bucketregexp>') AS bucket, endtime FROM repohistory WHERE
>> <criteria>) t0a, repohistory t1a
>>    //                   WHERE t0a.bucket=substring(t1a.entityid from
>> '<bucket_regexp>')
>>    //                      AND (t1a.starttime < t0a.endtime AND
>> t1a.endtime > t0a.endtime - <interval>
>>    //                      AND <criteria on t1a>
>>    //                          GROUP BY bucket,windowstart,windowend) t2
>>    //                              ORDER BY bucket ASC,bytecount
>> DESC) t3) t4 ORDER BY xxx LIMIT yyy OFFSET zzz;
>>
>> I have low confidence that ANY planner would be able to locate the
>> common part of a 2x larger query and not do it twice.
>>
>> Karl
>>
>>
>>
>> On Sun, Sep 19, 2010 at 12:05 PM, Alexey Serba <ase...@gmail.com> wrote:
>>>> The other thing is that we cannot afford to use the same "table"
>>>> twice, as it is actually an extremely expensive query in its own
>>>> right, with multiple joins, select distinct's, etc. under the covers.
>>> Even if you create indexes on bucket and activitycount columns? It
>>> might be that the query plans for these two queries (with "distinct
>>> on" hack and subquery max/subquery order limit/join) would be the
>>> same.
>>>
>>>> I'd be happy to post it but it may shock you. ;-)
>>> The way I indent SQL queries should say that I'm not afraid of
>>> multipage queries :)
>>>
>>>>
>>>> Karl
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Sun, Sep 19, 2010 at 11:32 AM, Alexey Serba <ase...@gmail.com> wrote:
>>>>>> SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount AS 
>>>>>> activitycount, t3.windowstart AS starttime, t3.windowend AS endtime FROM 
>>>>>> (...) t3
>>>>> Do you have primary key in your t3 table?
>>>>>
>>>>>> In Postgresql, what this does is to return the FIRST entire row matching 
>>>>>> each distinct idbucket result.
>>>>> FIRST based on which sort?
>>>>>
>>>>> Lets say you want to return FIRST row based on t3.windowstart column
>>>>> and you have primary key in t3 table. Then I believe your query can be
>>>>> rewritten in the following ways:
>>>>>
>>>>> 1. Using subqueries
>>>>> SELECT
>>>>>    bucket, primary_key, windowstart, etc
>>>>> FROM
>>>>>    table AS t1
>>>>> WHERE
>>>>>    windowstart=( SELECT max(windowstart) FROM table AS t2 WHERE
>>>>> bucket = t1.bucket )
>>>>>
>>>>> 2. Using joins instead of subqueries ( in case Derby doesn't support
>>>>> subqueries - not sure about that )
>>>>> SELECT
>>>>>    t1.bucket, t1.primary_key, windowstart, etc
>>>>> FROM
>>>>>    table AS t1
>>>>>    LEFT OUTER JOIN table AS t2 ON ( t1.bucket = t2.bucket AND
>>>>> t2.windowstart > t1.windowstart )
>>>>> WHERE
>>>>>    t2.primary_key IS NULL
>>>>>
>>>>> HTH,
>>>>> Alex
>>>>>
>>>>> On Sat, Sep 18, 2010 at 2:28 PM, Karl Wright <daddy...@gmail.com> wrote:
>>>>>> Hi Folks,
>>>>>>
>>>>>> For two of the report queries, ACF uses the following Postgresql
>>>>>> construct, which sadly seems to have no Derby equivalent:
>>>>>>
>>>>>> SELECT DISTINCT ON (idbucket) t3.bucket AS idbucket, t3.activitycount
>>>>>> AS activitycount, t3.windowstart AS starttime, t3.windowend AS endtime
>>>>>> FROM (...) t3
>>>>>>
>>>>>> In Postgresql, what this does is to return the FIRST entire row
>>>>>> matching each distinct idbucket result.  If Derby had a "FIRST()"
>>>>>> aggregate function, it would be the equivalent of:
>>>>>>
>>>>>> SELECT t3.bucket AS idbucket, FIRST(t3.activitycount) AS
>>>>>> activitycount, FIRST(t3.windowstart) AS starttime, FIRST(t3.windowend)
>>>>>> AS endtime FROM (...) t3 GROUP BY t3.bucket
>>>>>>
>>>>>> Unfortunately, Derby has no such aggregate function.  Furthermore, it
>>>>>> would not be ideal if I were to do the work myself in ACF, because
>>>>>> this is a resultset that needs to be paged through with offset and
>>>>>> length, for presentation to the user and sorting, so it gets wrapped
>>>>>> in another SELECT ... FROM (...) ORDER BY ... OFFSET ... LIMIT ...
>>>>>> that does that part.
>>>>>>
>>>>>> Does anyone have any ideas and/or Derby contacts?  I'd really like the
>>>>>> quick-start example to have a functional set of reports.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Derby SQL ideas needed

Reply via email to