Re: [jira] [Commented] (DERBY-6942) Utilise additional statistics for selectivity estimates.

2017-08-27 Thread Bryan Pendleton
Perhaps you could open the appropriate JIRA issues (to add documentation;
to add additional data types; etc.) as sub-tasks of this JIRA issue, to
help us keep track of it all?


[jira] [Commented] (DERBY-6942) Utilise additional statistics for selectivity estimates.

2017-08-26 Thread Bryan Pendleton (JIRA)

[ 
https://issues.apache.org/jira/browse/DERBY-6942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16142950#comment-16142950
 ] 

Bryan Pendleton commented on DERBY-6942:


I think it would be fine to add additional data types separately, over time.

I also think it would be fine to add documentation separately, over time.

I agree that test cases for the basic functionality are necessary before we can 
commit the initial work.

> Utilise additional statistics for selectivity estimates.
> 
>
> Key: DERBY-6942
> URL: https://issues.apache.org/jira/browse/DERBY-6942
> Project: Derby
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Harshvardhan Gupta
>Assignee: Harshvardhan Gupta
>Priority: Minor
> Attachments: DERBY-6942_2.diff, DERBY-6942.diff, z12.txt, z13.txt
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DERBY-6942) Utilise additional statistics for selectivity estimates.

2017-08-25 Thread Harshvardhan Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/DERBY-6942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16141740#comment-16141740
 ] 

Harshvardhan Gupta commented on DERBY-6942:
---

Bryan,

I am summarizing the remaining work required on releasing the changes done as 
part of Derby-6940 and Derby-6942.

Test cases are required for addition of new statistics and the integration of 
integer data type with the additional new stats. Also the patch doesn't contain 
documentation changes that need to be done for each of the above new 
functionalities.

Also, what do you think of integrating additional data types such as varchar 
etc and releasing all the changes together or doing it iteratively over time? 

> Utilise additional statistics for selectivity estimates.
> 
>
> Key: DERBY-6942
> URL: https://issues.apache.org/jira/browse/DERBY-6942
> Project: Derby
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Harshvardhan Gupta
>Assignee: Harshvardhan Gupta
>Priority: Minor
> Attachments: DERBY-6942_2.diff, DERBY-6942.diff, z12.txt, z13.txt
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DERBY-6942) Utilise additional statistics for selectivity estimates.

2017-07-15 Thread Bryan Pendleton (JIRA)

[ 
https://issues.apache.org/jira/browse/DERBY-6942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16088698#comment-16088698
 ] 

Bryan Pendleton commented on DERBY-6942:


Can you attach any of the output of your testing? For example, can you capture 
a terminal session with "ij" showing the queries you ran, and the results of 
the changed behavior on the various queries with the new selectivity estimation?

> Utilise additional statistics for selectivity estimates.
> 
>
> Key: DERBY-6942
> URL: https://issues.apache.org/jira/browse/DERBY-6942
> Project: Derby
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Harshvardhan Gupta
>Assignee: Harshvardhan Gupta
>Priority: Minor
> Attachments: DERBY-6942_2.diff, DERBY-6942.diff, z12.txt, z13.txt
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DERBY-6942) Utilise additional statistics for selectivity estimates.

2017-07-11 Thread Harshvardhan Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/DERBY-6942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16082789#comment-16082789
 ] 

Harshvardhan Gupta commented on DERBY-6942:
---

During the local tests I did, I tested the selectivity outputs for relational 
operators of the following types : (>=, <=, > , <) and integer data type with 
the latest patch.

Regarding the examples Bryan mentioned, I am not quite sure about this but I 
believe 'between' operator is internally transformed into one of these 
relational ops, in that case the latest patch is good for 'between' op also.

> Utilise additional statistics for selectivity estimates.
> 
>
> Key: DERBY-6942
> URL: https://issues.apache.org/jira/browse/DERBY-6942
> Project: Derby
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Harshvardhan Gupta
>Assignee: Harshvardhan Gupta
>Priority: Minor
> Attachments: DERBY-6942_2.diff, DERBY-6942.diff, z12.txt, z13.txt
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DERBY-6942) Utilise additional statistics for selectivity estimates.

2017-07-08 Thread Bryan Pendleton (JIRA)

[ 
https://issues.apache.org/jira/browse/DERBY-6942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079341#comment-16079341
 ] 

Bryan Pendleton commented on DERBY-6942:


This seems like a reasonable approach to me.

As I'm understanding your idea, it might be something like this (I don't have 
the syntax quite right, just trying to understand your idea):

{code:java}
float estimateProportion( Object queryStart, Object queryEnd, Object tableMin, 
Object tableMax )
{code}
or
{code:java}
float estimateProportion(  queryStart,  queryEnd,  tableMin,  
tableMax )
{code}
as a generic method.

where queryStart and queryEnd come from the WHERE clause of my SELECT, and 
tableMin and tableMax come from the statistics.

and the estimateProportion method would return a value between 0 and 1.


So, for example,

{code:java}
SELECT * FROM Flights WHERE flight_id BETWEEN 'AA' AND 'AA1112'
{code}

or

{code:java}
SELECT * FROM FlightAvailability WHERE business_seats_taken BETWEEN 0 AND 10
{code}

In the above examples, queryStart/queryEnd would be  'AA'/'AA1112' in the 
first case, and 0/10 in the second.

In the case of an integer, it might be implemented, more or less, as:
{code:java}
return (float)(queryEnd-queryStart)/(float)(tableMax-tableMin);
{code}
but as you say, data types like string or date would have more complex analysis 
to do.

It seems like this might fit pretty well into the DataValueDescriptor group of 
Java classes in Derby.

Have a look at the code in 
java/engine/org/apache/derby/iapi/types/*DataValue*.java, and see what you 
think?


> Utilise additional statistics for selectivity estimates.
> 
>
> Key: DERBY-6942
> URL: https://issues.apache.org/jira/browse/DERBY-6942
> Project: Derby
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Harshvardhan Gupta
>Assignee: Harshvardhan Gupta
>Priority: Minor
> Attachments: DERBY-6942.diff, z12.txt, z13.txt
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DERBY-6942) Utilise additional statistics for selectivity estimates.

2017-07-08 Thread Harshvardhan Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/DERBY-6942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079256#comment-16079256
 ] 

Harshvardhan Gupta commented on DERBY-6942:
---

I experimented with integrating max and min values similar to null values. 
Unlike a straightforward null count, the primary challenge associated with this 
problem is that we are going to need to define a distance metric with each of 
the sql types in derby.

For each example we need a procedure associate with varchar datatype which will 
tell derby that a string starting with 'L' is approximately equally far away 
from string startings with 'A' and 'Z'.(and thus give away a selectivity of 50% 
instead of relying on hardwired estimates)

For some data types this is going to be quite straightforward like for int, 
long etc. Nevertheless, extensive testing is going to be required for each of 
the data type which is integrated in the above process.

> Utilise additional statistics for selectivity estimates.
> 
>
> Key: DERBY-6942
> URL: https://issues.apache.org/jira/browse/DERBY-6942
> Project: Derby
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Harshvardhan Gupta
>Assignee: Harshvardhan Gupta
>Priority: Minor
> Attachments: DERBY-6942.diff, z12.txt, z13.txt
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DERBY-6942) Utilise additional statistics for selectivity estimates.

2017-07-01 Thread Harshvardhan Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/DERBY-6942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16071413#comment-16071413
 ] 

Harshvardhan Gupta commented on DERBY-6942:
---

Yes, the cardinality and cost estimates are both going to be more accurate, 
typically in case of loop joins cost is estimated as (number of rows in outer 
tables * cost for single row match in inner table).

> Utilise additional statistics for selectivity estimates.
> 
>
> Key: DERBY-6942
> URL: https://issues.apache.org/jira/browse/DERBY-6942
> Project: Derby
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Harshvardhan Gupta
>Assignee: Harshvardhan Gupta
>Priority: Minor
> Attachments: DERBY-6942.diff, z12.txt, z13.txt
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DERBY-6942) Utilise additional statistics for selectivity estimates.

2017-07-01 Thread Bryan Pendleton (JIRA)

[ 
https://issues.apache.org/jira/browse/DERBY-6942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16071408#comment-16071408
 ] 

Bryan Pendleton commented on DERBY-6942:


Let me see if I get that, too: the cost of the table scan is the same in either 
case, because we have to read every page of the table and evaluate the 
predicate against each row.

However, the number of rows in the result depends very heavily on the accuracy 
of the selectivity estimate. So if the results of the table scan go on to be 
used in a join with some other table, then the cost of the join will depend on 
the number of rows we read from the base table, and we can now have a more 
accurate estimate of that.

> Utilise additional statistics for selectivity estimates.
> 
>
> Key: DERBY-6942
> URL: https://issues.apache.org/jira/browse/DERBY-6942
> Project: Derby
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Harshvardhan Gupta
>Assignee: Harshvardhan Gupta
>Priority: Minor
> Attachments: DERBY-6942.diff, z12.txt, z13.txt
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DERBY-6942) Utilise additional statistics for selectivity estimates.

2017-07-01 Thread Harshvardhan Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/DERBY-6942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16071405#comment-16071405
 ] 

Harshvardhan Gupta commented on DERBY-6942:
---

Let me emphasize on this improvement in case of multiple joins: Although in the 
above example my changes didn't change the cost (it shouldn't since during 
execution by that conglomerate needs to do the same work), improvement in 
cardinality estimates will help us avoid cascading errors during cost 
estimation of joins.

> Utilise additional statistics for selectivity estimates.
> 
>
> Key: DERBY-6942
> URL: https://issues.apache.org/jira/browse/DERBY-6942
> Project: Derby
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Harshvardhan Gupta
>Assignee: Harshvardhan Gupta
>Priority: Minor
> Attachments: DERBY-6942.diff, z12.txt, z13.txt
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DERBY-6942) Utilise additional statistics for selectivity estimates.

2017-07-01 Thread Harshvardhan Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/DERBY-6942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16071404#comment-16071404
 ] 

Harshvardhan Gupta commented on DERBY-6942:
---

Bryan, You are on point!

> Utilise additional statistics for selectivity estimates.
> 
>
> Key: DERBY-6942
> URL: https://issues.apache.org/jira/browse/DERBY-6942
> Project: Derby
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Harshvardhan Gupta
>Assignee: Harshvardhan Gupta
>Priority: Minor
> Attachments: DERBY-6942.diff, z12.txt, z13.txt
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DERBY-6942) Utilise additional statistics for selectivity estimates.

2017-07-01 Thread Bryan Pendleton (JIRA)

[ 
https://issues.apache.org/jira/browse/DERBY-6942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16071403#comment-16071403
 ] 

Bryan Pendleton commented on DERBY-6942:


Let me see if I understand your conclusions so far: comparing z12.txt, and 
z13.txt, when the optimizer is considering the conglomerateNumber 1168, which 
is your base table ("abcd"), the optimizer has the qualifier "id is null" to 
consider. It reports that as an "extra qualifier", and applies a selectivity 
estimate to make a guess as to how many rows from the base table will match 
that predicate.

Without your new statistics, we have z12.txt, where the optimizer reports:

Number of extra qualifiers is : 1, extra qualifier selectivity is : 0.1

With your new statistics, we have z13.txt, where the optimizer reports:

Number of extra qualifiers is : 1, extra qualifier selectivity is : 0.5

Your point is that, in the first case, the 0.1 is a hard-coded selectivity 
estimate which may or may not be accurate at all, while in the second case, 
where we have 0.5, this comes from a real observation of the data which counted 
the number of rows which had a null value for "id" compared to the total number 
of rows in the table.

Am I understanding correctly? That's neat!


> Utilise additional statistics for selectivity estimates.
> 
>
> Key: DERBY-6942
> URL: https://issues.apache.org/jira/browse/DERBY-6942
> Project: Derby
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Harshvardhan Gupta
>Assignee: Harshvardhan Gupta
>Priority: Minor
> Attachments: DERBY-6942.diff, z12.txt, z13.txt
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DERBY-6942) Utilise additional statistics for selectivity estimates.

2017-07-01 Thread Harshvardhan Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/DERBY-6942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16071337#comment-16071337
 ] 

Harshvardhan Gupta commented on DERBY-6942:
---

In code, I am trying to query the statistics of those column which are backed 
by indexes. (and thus their statistics exist). 

I'm struck on this problem: How do we take a predicate and find out the 
relevant conglomerate from the list of all conglomerates associated with that 
table. From a predicate we can find out the ColumnReference which holds the 
Column Name, so I am trying to use this info to find out the relevant info. I 
would appreciate help on this blocker.

> Utilise additional statistics for selectivity estimates.
> 
>
> Key: DERBY-6942
> URL: https://issues.apache.org/jira/browse/DERBY-6942
> Project: Derby
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Harshvardhan Gupta
>Assignee: Harshvardhan Gupta
>Priority: Minor
> Attachments: z12.txt
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DERBY-6942) Utilise additional statistics for selectivity estimates.

2017-06-28 Thread Harshvardhan Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/DERBY-6942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16067124#comment-16067124
 ] 

Harshvardhan Gupta commented on DERBY-6942:
---

In particular, when estimating cost for a particular conglomerate, it would be 
good if we start looking at other conglomerates' statistics for cardinality 
estimates. For example during table scan, we can start adjusting cardinality 
estimates using the statistics for index.

Note that cost and cardinality are two different concepts, cost will always be 
determined by the number and size of pages / rows that a particular access path 
has to read to execute its work. Table scan's cost will remain unaffected here 
in this case by changing thecardinality estimates. (However, index scan's cost 
and cardinality estimates will be co-related in our particular example)

> Utilise additional statistics for selectivity estimates.
> 
>
> Key: DERBY-6942
> URL: https://issues.apache.org/jira/browse/DERBY-6942
> Project: Derby
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Harshvardhan Gupta
>Assignee: Harshvardhan Gupta
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DERBY-6942) Utilise additional statistics for selectivity estimates.

2017-06-28 Thread Harshvardhan Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/DERBY-6942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16067117#comment-16067117
 ] 

Harshvardhan Gupta commented on DERBY-6942:
---

The example I discussed above was quite a simple example, when there are 
multiple joins, the cardinality estimates become important. Since the result of 
a particular table node act as outer table for the nested join, errors will 
propagate outwards and small errors will become significant. (cost of scans are 
determined by multiplication from the total number of rows of outer table if 
table locks are being used)

> Utilise additional statistics for selectivity estimates.
> 
>
> Key: DERBY-6942
> URL: https://issues.apache.org/jira/browse/DERBY-6942
> Project: Derby
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Harshvardhan Gupta
>Assignee: Harshvardhan Gupta
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DERBY-6942) Utilise additional statistics for selectivity estimates.

2017-06-28 Thread Harshvardhan Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/DERBY-6942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16067113#comment-16067113
 ] 

Harshvardhan Gupta commented on DERBY-6942:
---

The estimateCost function in FromBaseTable.java is the guts of the cost and 
cardinality estimation logic for base table predicates.

The function is called for each of conglomerate (i.e each of the access path 
possible on the table such as indexes, full table scan). For predicates that 
can be utilized as start or stop predicates for the given conglomerate (refer 
http://db.apache.org/derby/docs/10.13/tuning/ctunoptimz24840.html), row 
estimates are obtained through the store by giving the actual constant keys 
specified in the query, the estimate by row is quite accurate as it finds out 
the fraction of values between the start and stop keys.

For all other remaining predicates one of the two things happen:

1) If an equality op is specified and statistics exist for the conglomerate 
(and it is not a valid start/stop predicate), we query statistics for 
selectivity.(it takes into account number of unique columns)

2) Hard wired selectivities are used for all other cases.


As part of DERBY-6940, we started collecting extra statistics such as null 
count, min and max value for each of the columns of the index. We would like to 
utilise them to eliminate usage of hard wired selectivities whenever possible.

For example - 

Let us say we have a table and an associated index created using the following 
statement - 

create table abcd(id integer);
create index idx on abcd(id);

Now, suppose a query is issued - 

select * from abcd where id is NULL;

There are two possible access paths for the query and both are considered by 
the query optimizer - 

1) Index Scan - The store is able to give away fairly accurate row estimates as 
'NULL' is a valid start as well as stop key.

2) Table scan - Here, hard wired selectivity estimates are used for row 
estimates as discussed above.

> Utilise additional statistics for selectivity estimates.
> 
>
> Key: DERBY-6942
> URL: https://issues.apache.org/jira/browse/DERBY-6942
> Project: Derby
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Harshvardhan Gupta
>Assignee: Harshvardhan Gupta
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)