Re: [jira] [Commented] (DERBY-6942) Utilise additional statistics for selectivity estimates.
Perhaps you could open the appropriate JIRA issues (to add documentation; to add additional data types; etc.) as sub-tasks of this JIRA issue, to help us keep track of it all?
[jira] [Commented] (DERBY-6942) Utilise additional statistics for selectivity estimates.
[ https://issues.apache.org/jira/browse/DERBY-6942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16142950#comment-16142950 ] Bryan Pendleton commented on DERBY-6942: I think it would be fine to add additional data types separately, over time. I also think it would be fine to add documentation separately, over time. I agree that test cases for the basic functionality are necessary before we can commit the initial work. > Utilise additional statistics for selectivity estimates. > > > Key: DERBY-6942 > URL: https://issues.apache.org/jira/browse/DERBY-6942 > Project: Derby > Issue Type: Sub-task > Components: SQL >Reporter: Harshvardhan Gupta >Assignee: Harshvardhan Gupta >Priority: Minor > Attachments: DERBY-6942_2.diff, DERBY-6942.diff, z12.txt, z13.txt > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DERBY-6942) Utilise additional statistics for selectivity estimates.
[ https://issues.apache.org/jira/browse/DERBY-6942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16141740#comment-16141740 ] Harshvardhan Gupta commented on DERBY-6942: --- Bryan, I am summarizing the remaining work required on releasing the changes done as part of Derby-6940 and Derby-6942. Test cases are required for addition of new statistics and the integration of integer data type with the additional new stats. Also the patch doesn't contain documentation changes that need to be done for each of the above new functionalities. Also, what do you think of integrating additional data types such as varchar etc and releasing all the changes together or doing it iteratively over time? > Utilise additional statistics for selectivity estimates. > > > Key: DERBY-6942 > URL: https://issues.apache.org/jira/browse/DERBY-6942 > Project: Derby > Issue Type: Sub-task > Components: SQL >Reporter: Harshvardhan Gupta >Assignee: Harshvardhan Gupta >Priority: Minor > Attachments: DERBY-6942_2.diff, DERBY-6942.diff, z12.txt, z13.txt > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DERBY-6942) Utilise additional statistics for selectivity estimates.
[ https://issues.apache.org/jira/browse/DERBY-6942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16088698#comment-16088698 ] Bryan Pendleton commented on DERBY-6942: Can you attach any of the output of your testing? For example, can you capture a terminal session with "ij" showing the queries you ran, and the results of the changed behavior on the various queries with the new selectivity estimation? > Utilise additional statistics for selectivity estimates. > > > Key: DERBY-6942 > URL: https://issues.apache.org/jira/browse/DERBY-6942 > Project: Derby > Issue Type: Sub-task > Components: SQL >Reporter: Harshvardhan Gupta >Assignee: Harshvardhan Gupta >Priority: Minor > Attachments: DERBY-6942_2.diff, DERBY-6942.diff, z12.txt, z13.txt > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DERBY-6942) Utilise additional statistics for selectivity estimates.
[ https://issues.apache.org/jira/browse/DERBY-6942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16082789#comment-16082789 ] Harshvardhan Gupta commented on DERBY-6942: --- During the local tests I did, I tested the selectivity outputs for relational operators of the following types : (>=, <=, > , <) and integer data type with the latest patch. Regarding the examples Bryan mentioned, I am not quite sure about this but I believe 'between' operator is internally transformed into one of these relational ops, in that case the latest patch is good for 'between' op also. > Utilise additional statistics for selectivity estimates. > > > Key: DERBY-6942 > URL: https://issues.apache.org/jira/browse/DERBY-6942 > Project: Derby > Issue Type: Sub-task > Components: SQL >Reporter: Harshvardhan Gupta >Assignee: Harshvardhan Gupta >Priority: Minor > Attachments: DERBY-6942_2.diff, DERBY-6942.diff, z12.txt, z13.txt > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DERBY-6942) Utilise additional statistics for selectivity estimates.
[ https://issues.apache.org/jira/browse/DERBY-6942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16079341#comment-16079341 ] Bryan Pendleton commented on DERBY-6942: This seems like a reasonable approach to me. As I'm understanding your idea, it might be something like this (I don't have the syntax quite right, just trying to understand your idea): {code:java} float estimateProportion( Object queryStart, Object queryEnd, Object tableMin, Object tableMax ) {code} or {code:java} float estimateProportion( queryStart, queryEnd, tableMin, tableMax ) {code} as a generic method. where queryStart and queryEnd come from the WHERE clause of my SELECT, and tableMin and tableMax come from the statistics. and the estimateProportion method would return a value between 0 and 1. So, for example, {code:java} SELECT * FROM Flights WHERE flight_id BETWEEN 'AA' AND 'AA1112' {code} or {code:java} SELECT * FROM FlightAvailability WHERE business_seats_taken BETWEEN 0 AND 10 {code} In the above examples, queryStart/queryEnd would be 'AA'/'AA1112' in the first case, and 0/10 in the second. In the case of an integer, it might be implemented, more or less, as: {code:java} return (float)(queryEnd-queryStart)/(float)(tableMax-tableMin); {code} but as you say, data types like string or date would have more complex analysis to do. It seems like this might fit pretty well into the DataValueDescriptor group of Java classes in Derby. Have a look at the code in java/engine/org/apache/derby/iapi/types/*DataValue*.java, and see what you think? > Utilise additional statistics for selectivity estimates. > > > Key: DERBY-6942 > URL: https://issues.apache.org/jira/browse/DERBY-6942 > Project: Derby > Issue Type: Sub-task > Components: SQL >Reporter: Harshvardhan Gupta >Assignee: Harshvardhan Gupta >Priority: Minor > Attachments: DERBY-6942.diff, z12.txt, z13.txt > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DERBY-6942) Utilise additional statistics for selectivity estimates.
[ https://issues.apache.org/jira/browse/DERBY-6942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16079256#comment-16079256 ] Harshvardhan Gupta commented on DERBY-6942: --- I experimented with integrating max and min values similar to null values. Unlike a straightforward null count, the primary challenge associated with this problem is that we are going to need to define a distance metric with each of the sql types in derby. For each example we need a procedure associate with varchar datatype which will tell derby that a string starting with 'L' is approximately equally far away from string startings with 'A' and 'Z'.(and thus give away a selectivity of 50% instead of relying on hardwired estimates) For some data types this is going to be quite straightforward like for int, long etc. Nevertheless, extensive testing is going to be required for each of the data type which is integrated in the above process. > Utilise additional statistics for selectivity estimates. > > > Key: DERBY-6942 > URL: https://issues.apache.org/jira/browse/DERBY-6942 > Project: Derby > Issue Type: Sub-task > Components: SQL >Reporter: Harshvardhan Gupta >Assignee: Harshvardhan Gupta >Priority: Minor > Attachments: DERBY-6942.diff, z12.txt, z13.txt > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DERBY-6942) Utilise additional statistics for selectivity estimates.
[ https://issues.apache.org/jira/browse/DERBY-6942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16071413#comment-16071413 ] Harshvardhan Gupta commented on DERBY-6942: --- Yes, the cardinality and cost estimates are both going to be more accurate, typically in case of loop joins cost is estimated as (number of rows in outer tables * cost for single row match in inner table). > Utilise additional statistics for selectivity estimates. > > > Key: DERBY-6942 > URL: https://issues.apache.org/jira/browse/DERBY-6942 > Project: Derby > Issue Type: Sub-task > Components: SQL >Reporter: Harshvardhan Gupta >Assignee: Harshvardhan Gupta >Priority: Minor > Attachments: DERBY-6942.diff, z12.txt, z13.txt > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DERBY-6942) Utilise additional statistics for selectivity estimates.
[ https://issues.apache.org/jira/browse/DERBY-6942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16071408#comment-16071408 ] Bryan Pendleton commented on DERBY-6942: Let me see if I get that, too: the cost of the table scan is the same in either case, because we have to read every page of the table and evaluate the predicate against each row. However, the number of rows in the result depends very heavily on the accuracy of the selectivity estimate. So if the results of the table scan go on to be used in a join with some other table, then the cost of the join will depend on the number of rows we read from the base table, and we can now have a more accurate estimate of that. > Utilise additional statistics for selectivity estimates. > > > Key: DERBY-6942 > URL: https://issues.apache.org/jira/browse/DERBY-6942 > Project: Derby > Issue Type: Sub-task > Components: SQL >Reporter: Harshvardhan Gupta >Assignee: Harshvardhan Gupta >Priority: Minor > Attachments: DERBY-6942.diff, z12.txt, z13.txt > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DERBY-6942) Utilise additional statistics for selectivity estimates.
[ https://issues.apache.org/jira/browse/DERBY-6942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16071405#comment-16071405 ] Harshvardhan Gupta commented on DERBY-6942: --- Let me emphasize on this improvement in case of multiple joins: Although in the above example my changes didn't change the cost (it shouldn't since during execution by that conglomerate needs to do the same work), improvement in cardinality estimates will help us avoid cascading errors during cost estimation of joins. > Utilise additional statistics for selectivity estimates. > > > Key: DERBY-6942 > URL: https://issues.apache.org/jira/browse/DERBY-6942 > Project: Derby > Issue Type: Sub-task > Components: SQL >Reporter: Harshvardhan Gupta >Assignee: Harshvardhan Gupta >Priority: Minor > Attachments: DERBY-6942.diff, z12.txt, z13.txt > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DERBY-6942) Utilise additional statistics for selectivity estimates.
[ https://issues.apache.org/jira/browse/DERBY-6942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16071404#comment-16071404 ] Harshvardhan Gupta commented on DERBY-6942: --- Bryan, You are on point! > Utilise additional statistics for selectivity estimates. > > > Key: DERBY-6942 > URL: https://issues.apache.org/jira/browse/DERBY-6942 > Project: Derby > Issue Type: Sub-task > Components: SQL >Reporter: Harshvardhan Gupta >Assignee: Harshvardhan Gupta >Priority: Minor > Attachments: DERBY-6942.diff, z12.txt, z13.txt > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DERBY-6942) Utilise additional statistics for selectivity estimates.
[ https://issues.apache.org/jira/browse/DERBY-6942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16071403#comment-16071403 ] Bryan Pendleton commented on DERBY-6942: Let me see if I understand your conclusions so far: comparing z12.txt, and z13.txt, when the optimizer is considering the conglomerateNumber 1168, which is your base table ("abcd"), the optimizer has the qualifier "id is null" to consider. It reports that as an "extra qualifier", and applies a selectivity estimate to make a guess as to how many rows from the base table will match that predicate. Without your new statistics, we have z12.txt, where the optimizer reports: Number of extra qualifiers is : 1, extra qualifier selectivity is : 0.1 With your new statistics, we have z13.txt, where the optimizer reports: Number of extra qualifiers is : 1, extra qualifier selectivity is : 0.5 Your point is that, in the first case, the 0.1 is a hard-coded selectivity estimate which may or may not be accurate at all, while in the second case, where we have 0.5, this comes from a real observation of the data which counted the number of rows which had a null value for "id" compared to the total number of rows in the table. Am I understanding correctly? That's neat! > Utilise additional statistics for selectivity estimates. > > > Key: DERBY-6942 > URL: https://issues.apache.org/jira/browse/DERBY-6942 > Project: Derby > Issue Type: Sub-task > Components: SQL >Reporter: Harshvardhan Gupta >Assignee: Harshvardhan Gupta >Priority: Minor > Attachments: DERBY-6942.diff, z12.txt, z13.txt > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DERBY-6942) Utilise additional statistics for selectivity estimates.
[ https://issues.apache.org/jira/browse/DERBY-6942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16071337#comment-16071337 ] Harshvardhan Gupta commented on DERBY-6942: --- In code, I am trying to query the statistics of those column which are backed by indexes. (and thus their statistics exist). I'm struck on this problem: How do we take a predicate and find out the relevant conglomerate from the list of all conglomerates associated with that table. From a predicate we can find out the ColumnReference which holds the Column Name, so I am trying to use this info to find out the relevant info. I would appreciate help on this blocker. > Utilise additional statistics for selectivity estimates. > > > Key: DERBY-6942 > URL: https://issues.apache.org/jira/browse/DERBY-6942 > Project: Derby > Issue Type: Sub-task > Components: SQL >Reporter: Harshvardhan Gupta >Assignee: Harshvardhan Gupta >Priority: Minor > Attachments: z12.txt > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DERBY-6942) Utilise additional statistics for selectivity estimates.
[ https://issues.apache.org/jira/browse/DERBY-6942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16067124#comment-16067124 ] Harshvardhan Gupta commented on DERBY-6942: --- In particular, when estimating cost for a particular conglomerate, it would be good if we start looking at other conglomerates' statistics for cardinality estimates. For example during table scan, we can start adjusting cardinality estimates using the statistics for index. Note that cost and cardinality are two different concepts, cost will always be determined by the number and size of pages / rows that a particular access path has to read to execute its work. Table scan's cost will remain unaffected here in this case by changing thecardinality estimates. (However, index scan's cost and cardinality estimates will be co-related in our particular example) > Utilise additional statistics for selectivity estimates. > > > Key: DERBY-6942 > URL: https://issues.apache.org/jira/browse/DERBY-6942 > Project: Derby > Issue Type: Sub-task > Components: SQL >Reporter: Harshvardhan Gupta >Assignee: Harshvardhan Gupta >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DERBY-6942) Utilise additional statistics for selectivity estimates.
[ https://issues.apache.org/jira/browse/DERBY-6942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16067117#comment-16067117 ] Harshvardhan Gupta commented on DERBY-6942: --- The example I discussed above was quite a simple example, when there are multiple joins, the cardinality estimates become important. Since the result of a particular table node act as outer table for the nested join, errors will propagate outwards and small errors will become significant. (cost of scans are determined by multiplication from the total number of rows of outer table if table locks are being used) > Utilise additional statistics for selectivity estimates. > > > Key: DERBY-6942 > URL: https://issues.apache.org/jira/browse/DERBY-6942 > Project: Derby > Issue Type: Sub-task > Components: SQL >Reporter: Harshvardhan Gupta >Assignee: Harshvardhan Gupta >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DERBY-6942) Utilise additional statistics for selectivity estimates.
[ https://issues.apache.org/jira/browse/DERBY-6942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16067113#comment-16067113 ] Harshvardhan Gupta commented on DERBY-6942: --- The estimateCost function in FromBaseTable.java is the guts of the cost and cardinality estimation logic for base table predicates. The function is called for each of conglomerate (i.e each of the access path possible on the table such as indexes, full table scan). For predicates that can be utilized as start or stop predicates for the given conglomerate (refer http://db.apache.org/derby/docs/10.13/tuning/ctunoptimz24840.html), row estimates are obtained through the store by giving the actual constant keys specified in the query, the estimate by row is quite accurate as it finds out the fraction of values between the start and stop keys. For all other remaining predicates one of the two things happen: 1) If an equality op is specified and statistics exist for the conglomerate (and it is not a valid start/stop predicate), we query statistics for selectivity.(it takes into account number of unique columns) 2) Hard wired selectivities are used for all other cases. As part of DERBY-6940, we started collecting extra statistics such as null count, min and max value for each of the columns of the index. We would like to utilise them to eliminate usage of hard wired selectivities whenever possible. For example - Let us say we have a table and an associated index created using the following statement - create table abcd(id integer); create index idx on abcd(id); Now, suppose a query is issued - select * from abcd where id is NULL; There are two possible access paths for the query and both are considered by the query optimizer - 1) Index Scan - The store is able to give away fairly accurate row estimates as 'NULL' is a valid start as well as stop key. 2) Table scan - Here, hard wired selectivity estimates are used for row estimates as discussed above. > Utilise additional statistics for selectivity estimates. > > > Key: DERBY-6942 > URL: https://issues.apache.org/jira/browse/DERBY-6942 > Project: Derby > Issue Type: Sub-task > Components: SQL >Reporter: Harshvardhan Gupta >Assignee: Harshvardhan Gupta >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.4.14#64029)