[jira] [Commented] (DERBY-6940) Enhance derby statistics for more accurate selectivity estimates.

2017-06-26 Thread Harshvardhan Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/DERBY-6940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16063622#comment-16063622
 ] 

Harshvardhan Gupta commented on DERBY-6940:
---

Since I am performing deep copy of the objects, I don't think (min==max) will 
work in our case. So I had to use the equality function that is implemented by 
each data type, the problem is that I need to catch the StandardException 
thrown by those functions and I can't delegate it to the writeExternal method 
since the interface definition won't let it compile.

I too believe that the code in present form is little clunky and open for 
potential improvements.

> Enhance derby statistics for more accurate selectivity estimates.
> -
>
> Key: DERBY-6940
> URL: https://issues.apache.org/jira/browse/DERBY-6940
> Project: Derby
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Harshvardhan Gupta
>Assignee: Harshvardhan Gupta
>Priority: Minor
> Attachments: DERBY-6940_2.diff, DERBY-6940_3.diff, derby-6940.diff, 
> EOFException_derby.log, EOFException.txt
>
>
> Derby should collect extra statistics during index build time, statistics 
> refresh time which will help optimizer make more precise selectivity 
> estimates and chose better execution paths.
> We eventually want to utilize the new statistics to make better selectivity 
> estimates / cost estimates that will help find the best query plan. Currently 
> Derby keeps two type of stats - the total row count and the number of unique 
> values.
> We are initially extending the stats to include null count, the minimum value 
> and maximum value associated with each of the columns of an index. This would 
> be useful in selectivity estimates for operators such as [ IS NULL, <, <=, >, 
> >= ] , all of which currently rely on hardwired selectivity estimates.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DERBY-6940) Enhance derby statistics for more accurate selectivity estimates.

2017-06-23 Thread Bryan Pendleton (JIRA)

[ 
https://issues.apache.org/jira/browse/DERBY-6940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16061702#comment-16061702
 ] 

Bryan Pendleton commented on DERBY-6940:


Your approach to addressing the DERBY-3219 problems seems fine to me. I think 
it is quite clever and should get us the behavior that we need.

Can you clarify why the exception handling is needed in the writeExternal 
method? That is, why isn't it more like:


{code:java}
writeBoolean( min == max);
if (min != max)
writeObject(maxVal);
writeObject(minVal);
{code}

Regarding additional statistics, I don't have any to add at this time. I'm a 
big fan of incremental improvement, so I'm happy to add only the statistics 
that we need now, and in the future if we should determine that additional 
statistics would be valuable, we can address those needs as follow-on projects.

However, yes, please, as you continue to study the code, and think about the 
approaches and possibilities, do let us know what additional statistics you 
think would be helpful, that is much appreciated!

> Enhance derby statistics for more accurate selectivity estimates.
> -
>
> Key: DERBY-6940
> URL: https://issues.apache.org/jira/browse/DERBY-6940
> Project: Derby
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Harshvardhan Gupta
>Assignee: Harshvardhan Gupta
>Priority: Minor
> Attachments: DERBY-6940_2.diff, DERBY-6940_3.diff, derby-6940.diff, 
> EOFException_derby.log, EOFException.txt
>
>
> Derby should collect extra statistics during index build time, statistics 
> refresh time which will help optimizer make more precise selectivity 
> estimates and chose better execution paths.
> We eventually want to utilize the new statistics to make better selectivity 
> estimates / cost estimates that will help find the best query plan. Currently 
> Derby keeps two type of stats - the total row count and the number of unique 
> values.
> We are initially extending the stats to include null count, the minimum value 
> and maximum value associated with each of the columns of an index. This would 
> be useful in selectivity estimates for operators such as [ IS NULL, <, <=, >, 
> >= ] , all of which currently rely on hardwired selectivity estimates.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DERBY-6940) Enhance derby statistics for more accurate selectivity estimates.

2017-06-23 Thread Harshvardhan Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/DERBY-6940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16061215#comment-16061215
 ] 

Harshvardhan Gupta commented on DERBY-6940:
---

Also, I am creating another issue which will track the progress of integration 
of the new statistics into selectivity estimates. Meanwhile, I would like to 
invite your thoughts on other statistics we should consider. Most common values 
and distribution buckets are on the top of my mind right now, I'd appreciate 
your thoughts on the same. The average size of row written to disk is important 
for cost estimation and a quick examination of derby optimizer's code reveals 
that derby is already using it in some form, nevertheless I would revisit it 
during my analysis of cost estimation over the course of this project. 

> Enhance derby statistics for more accurate selectivity estimates.
> -
>
> Key: DERBY-6940
> URL: https://issues.apache.org/jira/browse/DERBY-6940
> Project: Derby
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Harshvardhan Gupta
>Assignee: Harshvardhan Gupta
>Priority: Minor
> Attachments: DERBY-6940_2.diff, DERBY-6940_3.diff, derby-6940.diff, 
> EOFException_derby.log, EOFException.txt
>
>
> Derby should collect extra statistics during index build time, statistics 
> refresh time which will help optimizer make more precise selectivity 
> estimates and chose better execution paths.
> We eventually want to utilize the new statistics to make better selectivity 
> estimates / cost estimates that will help find the best query plan. Currently 
> Derby keeps two type of stats - the total row count and the number of unique 
> values.
> We are initially extending the stats to include null count, the minimum value 
> and maximum value associated with each of the columns of an index. This would 
> be useful in selectivity estimates for operators such as [ IS NULL, <, <=, >, 
> >= ] , all of which currently rely on hardwired selectivity estimates.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DERBY-6940) Enhance derby statistics for more accurate selectivity estimates.

2017-06-23 Thread Harshvardhan Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/DERBY-6940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16061206#comment-16061206
 ] 

Harshvardhan Gupta commented on DERBY-6940:
---

Hi Bryan,

I thought of a workaround and was successful. In particular, I am comparing the 
maxVal and minVal and if both are equal I first write an indicator boolean and 
then write only one DataValueDescriptor object. In all other cases, I first 
write maxVal and then minVal, In this way the problematic object will always be 
written last once.

public void writeExternal(ObjectOutput out)
 throws IOException
{
FormatableHashtable fh = new FormatableHashtable();
fh.putLong("numRows", numRows);
fh.putLong("numUnique", numUnique);
fh.putLong("nullCount", nullCount);
out.writeObject(fh);
try{
if (maxVal.equals(maxVal, minVal).getBoolean()) {
out.writeBoolean(true);
out.writeObject(minVal);
return;
}
}
catch(StandardException e){

}
finally {
out.writeBoolean(false);
out.writeObject(maxVal);
out.writeObject(minVal);
}
}



public void readExternal(ObjectInput in)
throws IOException, ClassNotFoundException
{
FormatableHashtable fh = (FormatableHashtable)in.readObject();
numRows = fh.getLong("numRows");
numUnique = fh.getLong("numUnique");
nullCount = fh.getLong("nullCount");
if(in.readBoolean()){
maxVal = (DataValueDescriptor)in.readObject();
minVal = maxVal.cloneValue(true);
}
else{
maxVal = (DataValueDescriptor) in.readObject();
minVal = (DataValueDescriptor) in.readObject();
}
}

> Enhance derby statistics for more accurate selectivity estimates.
> -
>
> Key: DERBY-6940
> URL: https://issues.apache.org/jira/browse/DERBY-6940
> Project: Derby
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Harshvardhan Gupta
>Assignee: Harshvardhan Gupta
>Priority: Minor
> Attachments: DERBY-6940_2.diff, DERBY-6940_3.diff, derby-6940.diff, 
> EOFException_derby.log, EOFException.txt
>
>
> Derby should collect extra statistics during index build time, statistics 
> refresh time which will help optimizer make more precise selectivity 
> estimates and chose better execution paths.
> We eventually want to utilize the new statistics to make better selectivity 
> estimates / cost estimates that will help find the best query plan. Currently 
> Derby keeps two type of stats - the total row count and the number of unique 
> values.
> We are initially extending the stats to include null count, the minimum value 
> and maximum value associated with each of the columns of an index. This would 
> be useful in selectivity estimates for operators such as [ IS NULL, <, <=, >, 
> >= ] , all of which currently rely on hardwired selectivity estimates.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DERBY-6940) Enhance derby statistics for more accurate selectivity estimates.

2017-06-22 Thread Harshvardhan Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/DERBY-6940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16060091#comment-16060091
 ] 

Harshvardhan Gupta commented on DERBY-6940:
---

While debugging the failed test cases I seem to have reproduced a previously 
know issue. Please take a look at my comment on DERBY-3219.

> Enhance derby statistics for more accurate selectivity estimates.
> -
>
> Key: DERBY-6940
> URL: https://issues.apache.org/jira/browse/DERBY-6940
> Project: Derby
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Harshvardhan Gupta
>Assignee: Harshvardhan Gupta
>Priority: Minor
> Attachments: DERBY-6940_2.diff, DERBY-6940_3.diff, derby-6940.diff
>
>
> Derby should collect extra statistics during index build time, statistics 
> refresh time which will help optimizer make more precise selectivity 
> estimates and chose better execution paths.
> We eventually want to utilize the new statistics to make better selectivity 
> estimates / cost estimates that will help find the best query plan. Currently 
> Derby keeps two type of stats - the total row count and the number of unique 
> values.
> We are initially extending the stats to include null count, the minimum value 
> and maximum value associated with each of the columns of an index. This would 
> be useful in selectivity estimates for operators such as [ IS NULL, <, <=, >, 
> >= ] , all of which currently rely on hardwired selectivity estimates.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DERBY-6940) Enhance derby statistics for more accurate selectivity estimates.

2017-06-21 Thread Bryan Pendleton (JIRA)

[ 
https://issues.apache.org/jira/browse/DERBY-6940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16058566#comment-16058566
 ] 

Bryan Pendleton commented on DERBY-6940:


Regarding your observation about average row size and cost estimation, I 
thought it would be worthwhile to note that there is some evidence that cost 
estimation problems result in substantial impact to the Derby optimizer's query 
planning algorithms. For example, see: DERBY-1905, DERBY-1260, DERBY-1205, 
DERBY-1259, and DERBY-1007.

I'm not suggesting that we should take action on any of those issues right 
away, just noting that it's important to keep cost estimation behaviors in mind 
as we study the query optimizer's behaviors.


> Enhance derby statistics for more accurate selectivity estimates.
> -
>
> Key: DERBY-6940
> URL: https://issues.apache.org/jira/browse/DERBY-6940
> Project: Derby
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Harshvardhan Gupta
>Assignee: Harshvardhan Gupta
>Priority: Minor
> Attachments: DERBY-6940_2.diff, DERBY-6940_3.diff, derby-6940.diff
>
>
> Derby should collect extra statistics during index build time, statistics 
> refresh time which will help optimizer make more precise selectivity 
> estimates and chose better execution paths.
> We eventually want to utilize the new statistics to make better selectivity 
> estimates / cost estimates that will help find the best query plan. Currently 
> Derby keeps two type of stats - the total row count and the number of unique 
> values.
> We are initially extending the stats to include null count, the minimum value 
> and maximum value associated with each of the columns of an index. This would 
> be useful in selectivity estimates for operators such as [ IS NULL, <, <=, >, 
> >= ] , all of which currently rely on hardwired selectivity estimates.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DERBY-6940) Enhance derby statistics for more accurate selectivity estimates.

2017-06-21 Thread Bryan Pendleton (JIRA)

[ 
https://issues.apache.org/jira/browse/DERBY-6940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16058547#comment-16058547
 ] 

Bryan Pendleton commented on DERBY-6940:


Thanks for the background information, and for the new patch. This all sounds 
quite promising to me.

I will download your updated patch and study it and see how it behaves in my 
environment.


> Enhance derby statistics for more accurate selectivity estimates.
> -
>
> Key: DERBY-6940
> URL: https://issues.apache.org/jira/browse/DERBY-6940
> Project: Derby
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Harshvardhan Gupta
>Assignee: Harshvardhan Gupta
>Priority: Minor
> Attachments: DERBY-6940_2.diff, DERBY-6940_3.diff, derby-6940.diff
>
>
> Derby should collect extra statistics during index build time, statistics 
> refresh time which will help optimizer make more precise selectivity 
> estimates and chose better execution paths.
> We eventually want to utilize the new statistics to make better selectivity 
> estimates / cost estimates that will help find the best query plan. Currently 
> Derby keeps two type of stats - the total row count and the number of unique 
> values.
> We are initially extending the stats to include null count, the minimum value 
> and maximum value associated with each of the columns of an index. This would 
> be useful in selectivity estimates for operators such as [ IS NULL, <, <=, >, 
> >= ] , all of which currently rely on hardwired selectivity estimates.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DERBY-6940) Enhance derby statistics for more accurate selectivity estimates.

2017-06-21 Thread Harshvardhan Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/DERBY-6940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16058041#comment-16058041
 ] 

Harshvardhan Gupta commented on DERBY-6940:
---

Another related link I missed earlier - 
https://www.postgresql.org/docs/9.1/static/planner-stats-details.html

> Enhance derby statistics for more accurate selectivity estimates.
> -
>
> Key: DERBY-6940
> URL: https://issues.apache.org/jira/browse/DERBY-6940
> Project: Derby
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Harshvardhan Gupta
>Assignee: Harshvardhan Gupta
>Priority: Minor
> Attachments: DERBY-6940_2.diff, DERBY-6940_3.diff, derby-6940.diff
>
>
> Derby should collect extra statistics during index build time, statistics 
> refresh time which will help optimizer make more precise selectivity 
> estimates and chose better execution paths.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DERBY-6940) Enhance derby statistics for more accurate selectivity estimates.

2017-06-20 Thread Bryan Pendleton (JIRA)

[ 
https://issues.apache.org/jira/browse/DERBY-6940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16056415#comment-16056415
 ] 

Bryan Pendleton commented on DERBY-6940:


It seems like you are making very good progress on this! 

I wonder: if you follow the code paths for SYSCOLUMNS.COLUMNDEFAULT, will that 
give you some examples of how to store the min/max values in the statistics?

Also, I wonder if you could perhaps update the description of this job to 
provide some more background about the overall intent of the new statistics. It 
would be useful to have some (very high level) descriptions of how we would use 
these new statistics during query planning.

And, do you think this is a complete list of the additional statistics that it 
would be useful to have? For example, is it worth comparing the Derby-collected 
statistics with the statistics used by, say, Postgres, to see if there are any 
other interesting statistics we could be collecting at this time?


> Enhance derby statistics for more accurate selectivity estimates.
> -
>
> Key: DERBY-6940
> URL: https://issues.apache.org/jira/browse/DERBY-6940
> Project: Derby
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Harshvardhan Gupta
>Assignee: Harshvardhan Gupta
>Priority: Minor
> Attachments: DERBY-6940_2.diff, derby-6940.diff
>
>
> Derby should collect extra statistics during index build time, statistics 
> refresh time which will help optimizer make more precise selectivity 
> estimates and chose better execution paths.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DERBY-6940) Enhance derby statistics for more accurate selectivity estimates.

2017-06-20 Thread Harshvardhan Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/DERBY-6940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16056086#comment-16056086
 ] 

Harshvardhan Gupta commented on DERBY-6940:
---

I am looking at object serialization as a possible solution but would like to 
be sure whether it makes sense to store the entire object or just its string 
representation.

> Enhance derby statistics for more accurate selectivity estimates.
> -
>
> Key: DERBY-6940
> URL: https://issues.apache.org/jira/browse/DERBY-6940
> Project: Derby
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Harshvardhan Gupta
>Assignee: Harshvardhan Gupta
>Priority: Minor
> Attachments: DERBY-6940_2.diff, derby-6940.diff
>
>
> Derby should collect extra statistics during index build time, statistics 
> refresh time which will help optimizer make more precise selectivity 
> estimates and chose better execution paths.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DERBY-6940) Enhance derby statistics for more accurate selectivity estimates.

2017-06-20 Thread Harshvardhan Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/DERBY-6940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16056079#comment-16056079
 ] 

Harshvardhan Gupta commented on DERBY-6940:
---

Another thing, I am able to write and read the minVal and maxVal as strings 
successfully using their toString methods, but when trying to store the 
original DataValueDescriptor object, I get the following error while trying to 
read the statistics -

ERROR XSDA8: Exception during restore of a serializable or SQLData object of cla
ss
ERROR XJ001: Java exception: 'java.lang.ClassCastException: java.lang.String can
not be cast to org.apache.derby.iapi.types.DataValueDescriptor: java.io.StreamCo
rruptedException'.
ERROR XJ001: Java exception: 'java.lang.String cannot be cast to org.apache.derb
y.iapi.types.DataValueDescriptor: java.lang.ClassCastException'. 

Storing minVal and maxVal using String is not preferred for comparisons later 
on, I am trying to store the original object.

> Enhance derby statistics for more accurate selectivity estimates.
> -
>
> Key: DERBY-6940
> URL: https://issues.apache.org/jira/browse/DERBY-6940
> Project: Derby
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Harshvardhan Gupta
>Assignee: Harshvardhan Gupta
>Priority: Minor
> Attachments: DERBY-6940_2.diff, derby-6940.diff
>
>
> Derby should collect extra statistics during index build time, statistics 
> refresh time which will help optimizer make more precise selectivity 
> estimates and chose better execution paths.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DERBY-6940) Enhance derby statistics for more accurate selectivity estimates.

2017-06-18 Thread Harshvardhan Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/DERBY-6940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16053201#comment-16053201
 ] 

Harshvardhan Gupta commented on DERBY-6940:
---

Additionally we need to work on upgrade procedures, while trying to access 
statistics info of an index created before this patch, Derby will throw a NPE 
due to absence of the new statistics. 

> Enhance derby statistics for more accurate selectivity estimates.
> -
>
> Key: DERBY-6940
> URL: https://issues.apache.org/jira/browse/DERBY-6940
> Project: Derby
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Harshvardhan Gupta
>Assignee: Harshvardhan Gupta
>Priority: Minor
> Attachments: derby-6940.diff
>
>
> Derby should collect extra statistics during index build time, statistics 
> refresh time which will help optimizer make more precise selectivity 
> estimates and chose better execution paths.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)