[jira] [Commented] (DERBY-6938) Obtain cardinality estimates and true estimates for base tables as well as for intermediate results for queries involving multiple joins.

2017-08-02 Thread Harshvardhan Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/DERBY-6938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16110605#comment-16110605
 ] 

Harshvardhan Gupta commented on DERBY-6938:
---

Bryan,

I agree with your observations. For production of efficient query plans, 
Cardinality Estimates, Cost estimation and Plan space are all inter dependent 
on each other. I am filing another issue under DERBY-6949 to track the effort 
on modelling HASH join's spill behaviour. I will also update cost modelling 
details under DERBY-6949 shortly after I had to take a detour analyzing HASH 
Join's behaviour.

Thanks,
Vardhan

>  Obtain cardinality estimates and true estimates for base tables as well as 
> for intermediate results for queries involving multiple joins. 
> ---
>
> Key: DERBY-6938
> URL: https://issues.apache.org/jira/browse/DERBY-6938
> Project: Derby
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Harshvardhan Gupta
>Assignee: Harshvardhan Gupta
> Attachments: explain.txt, traceout.txt
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DERBY-6938) Obtain cardinality estimates and true estimates for base tables as well as for intermediate results for queries involving multiple joins.

2017-07-22 Thread Bryan Pendleton (JIRA)

[ 
https://issues.apache.org/jira/browse/DERBY-6938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16097372#comment-16097372
 ] 

Bryan Pendleton commented on DERBY-6938:


Thank you for exploring this behavior in more detail, and for the clear 
explanation. It is very helpful!

Perhaps the critical question here involves the Optimizer's prediction about 
whether the intermediate results will fit in memory or not. It seems like that 
is the
area where the quality and accuracy of the cardinality and selectivity
estimates is crucial, since if those estimates are poor, the Optimizer will
have an inaccurate prediction about whether the intermediate results
will fit into memory or not, and it might avoid a hash join in a case where it
would in fact be the best approach, or vice versa might choose a hash join
in a case where the intermediate results are in fact very large and thus the
hash join will perform very poorly.


>  Obtain cardinality estimates and true estimates for base tables as well as 
> for intermediate results for queries involving multiple joins. 
> ---
>
> Key: DERBY-6938
> URL: https://issues.apache.org/jira/browse/DERBY-6938
> Project: Derby
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Harshvardhan Gupta
>Assignee: Harshvardhan Gupta
> Attachments: explain.txt, traceout.txt
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DERBY-6938) Obtain cardinality estimates and true estimates for base tables as well as for intermediate results for queries involving multiple joins.

2017-07-22 Thread Harshvardhan Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/DERBY-6938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16097272#comment-16097272
 ] 

Harshvardhan Gupta commented on DERBY-6938:
---

Summarizing  my analysis of HASH based joins so far -

Derby has the capability to perform HASH Joins spilling to disk but has long 
ignored them due to the absence of a cost model around such a behaviour. 
DERBY-1259 also talks about the history of HASH Joins and the changes made as 
part of DERBY-106.

>  Obtain cardinality estimates and true estimates for base tables as well as 
> for intermediate results for queries involving multiple joins. 
> ---
>
> Key: DERBY-6938
> URL: https://issues.apache.org/jira/browse/DERBY-6938
> Project: Derby
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Harshvardhan Gupta
>Assignee: Harshvardhan Gupta
> Attachments: explain.txt, traceout.txt
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DERBY-6938) Obtain cardinality estimates and true estimates for base tables as well as for intermediate results for queries involving multiple joins.

2017-07-22 Thread Harshvardhan Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/DERBY-6938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16097260#comment-16097260
 ] 

Harshvardhan Gupta commented on DERBY-6938:
---

Hi Bryan,

The Optimizer produces memory and cost estimates for all the possible access 
paths that it can build. Derby will only consider those HASH Joins to go 
through which it _predicts _ will fit into memory and rejects all others 
(including those which will be very efficient for some queries as opposed to 
loop based joins).

However, the memory predictions of Derby may not match memory requirement at 
execution and HASH table may become larger than what Derby predicted. DERBY-106 
was an effort to mitigate the OOM exceptions arising out of this behaviour. So 
although HASH tables do spill to disk but only for those access path go forward 
to execution where Derby predicts that the HASH Tables will not spill to disk.

The above observations of mine match with the comment on DERBY-1259 which was 
written a decade ago.

Coming to the resolution of this problem, we need to investigate how to measure 
the cost of execution when Derby will eventually allow those HASH based access 
paths that it predicts will spill to disk instead of simply ignoring them.

Also, interesting is the observation that Derby currently allows hash based 
joins when users specify their join strategies via Query Hints. 


>  Obtain cardinality estimates and true estimates for base tables as well as 
> for intermediate results for queries involving multiple joins. 
> ---
>
> Key: DERBY-6938
> URL: https://issues.apache.org/jira/browse/DERBY-6938
> Project: Derby
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Harshvardhan Gupta
>Assignee: Harshvardhan Gupta
> Attachments: explain.txt
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DERBY-6938) Obtain cardinality estimates and true estimates for base tables as well as for intermediate results for queries involving multiple joins.

2017-07-15 Thread Bryan Pendleton (JIRA)

[ 
https://issues.apache.org/jira/browse/DERBY-6938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16088752#comment-16088752
 ] 

Bryan Pendleton commented on DERBY-6938:


Hi Vardhan,

I think your observation about the Optimizer being unwilling to choose Hash 
Join plans if it believes they will not fit in memory is very interesting. Our 
documentation 
(https://db.apache.org/derby/docs/10.13/tuning/ctunoptimz23173.html) definitely 
claims that Derby will choose a hash join, and will spill to disk if the join 
is larger than will fit in memory, and I believe that was definitely the intent 
of DERBY-106.

But it appears that it isn't working as we expect.

In researching this, I came across a very interesting discussion in DERBY-4620.

Perhaps you could look through the DERBY-4620 work, and see if you have any 
additional observations to share regarding the behavior that you see?

>  Obtain cardinality estimates and true estimates for base tables as well as 
> for intermediate results for queries involving multiple joins. 
> ---
>
> Key: DERBY-6938
> URL: https://issues.apache.org/jira/browse/DERBY-6938
> Project: Derby
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Harshvardhan Gupta
>Assignee: Harshvardhan Gupta
> Attachments: explain.txt
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DERBY-6938) Obtain cardinality estimates and true estimates for base tables as well as for intermediate results for queries involving multiple joins.

2017-06-18 Thread Harshvardhan Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/DERBY-6938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16053197#comment-16053197
 ] 

Harshvardhan Gupta commented on DERBY-6938:
---

Tracking work on statistics enhancement at 
https://issues.apache.org/jira/browse/DERBY-6940

>  Obtain cardinality estimates and true estimates for base tables as well as 
> for intermediate results for queries involving multiple joins. 
> ---
>
> Key: DERBY-6938
> URL: https://issues.apache.org/jira/browse/DERBY-6938
> Project: Derby
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Harshvardhan Gupta
>Assignee: Harshvardhan Gupta
> Attachments: explain.txt
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DERBY-6938) Obtain cardinality estimates and true estimates for base tables as well as for intermediate results for queries involving multiple joins.

2017-06-14 Thread Harshvardhan Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/DERBY-6938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16048798#comment-16048798
 ] 

Harshvardhan Gupta commented on DERBY-6938:
---

The specific approach I am thinking is to keep the minimum and maximum value of 
columns and number of NULL values in statistics, this could be utilised in 
operators such as (< , > , <=, >=, IS NOT NULL, NULL) etc.

For example, lets say we have a int column and the minimum and maximum value is 
20 and 100 respectively. Then for a query predicate on that column with the 
condition that >=80 should ideally return 25% of all columns. This approach 
obviously assumes an uniform distribution but should be good to get started 
with. We should be able to make it more efficient by taking into account 
distribution later on.

>  Obtain cardinality estimates and true estimates for base tables as well as 
> for intermediate results for queries involving multiple joins. 
> ---
>
> Key: DERBY-6938
> URL: https://issues.apache.org/jira/browse/DERBY-6938
> Project: Derby
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Harshvardhan Gupta
>Assignee: Harshvardhan Gupta
> Attachments: explain.txt
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DERBY-6938) Obtain cardinality estimates and true estimates for base tables as well as for intermediate results for queries involving multiple joins.

2017-06-14 Thread Harshvardhan Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/DERBY-6938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16048793#comment-16048793
 ] 

Harshvardhan Gupta commented on DERBY-6938:
---

Bryan,
Regarding my doubt earlier, one thing that was particularly useful to dive deep 
into the optimizer was to enable optimizer tracing.
https://wiki.apache.org/db-derby/OptimizerTracing

The trace output is quite verbose and helps to understand the various choices 
the optimizer is making.

Few observations and scope of improvements that I would like to point out - 

1) Derby falls back to nested loops more often that we would like to 
particularly in case of large tables, currently the hash table resides entirely 
in memory and derby rules out the HASHJOIN approach if it suspects that it is 
going to be too large (default is 1048576)
Nested loops do not seem to be a good option specially when joining relatively 
large tables (similar to imdb dataset we are using) across more than 4 joins.

It is also documented in the optimizer paper that creating hash tables that 
spill to disk is a potential improvement and my experiments confirm that.

2) Another potential improvement with regards to cardinality estimates. Derby 
currently uses hard wired numbers for every operator other than the equality op 
for selectivity.
https://db.apache.org/derby/docs/10.0/manuals/tuning/perf56.html

In case of equality operator with a known value at compile time, it utilises 
statistics and make selectivity assumptions using number of unique values. I 
think we can enhance the statistics to be able to make better cardinality 
estimates. 


>  Obtain cardinality estimates and true estimates for base tables as well as 
> for intermediate results for queries involving multiple joins. 
> ---
>
> Key: DERBY-6938
> URL: https://issues.apache.org/jira/browse/DERBY-6938
> Project: Derby
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Harshvardhan Gupta
>Assignee: Harshvardhan Gupta
> Attachments: explain.txt
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DERBY-6938) Obtain cardinality estimates and true estimates for base tables as well as for intermediate results for queries involving multiple joins.

2017-06-14 Thread Harshvardhan Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/DERBY-6938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16048780#comment-16048780
 ] 

Harshvardhan Gupta commented on DERBY-6938:
---

To view and compare the estimates row counts and true row counts for base 
tables and for intermediate results the following queries can be used on xplain 
tables, the value of OP_IDENTIFIER can be changed to get data for nodes for a 
particular operation such as HASHJOIN, NLJOIN etc  

select SEEN_ROWS, SEEN_ROWS_RIGHT, RETURNED_ROWS, EST_ROW_COUNT  from 
SYSXPLAIN_RESULTSETS,SYSXPLAIN_STATEMENTS where OP_IDENTIFIER = 'HASHJOIN' and 
SYSXPLAIN_STATEMENTS.STMT_ID = SYSXPLAIN_RESULTSETS.STMT_ID;

To couple the scan information for nodes involved in scans - 

select SEEN_ROWS, RETURNED_ROWS, EST_ROW_COUNT , OP_IDENTIFIER from 
SYSXPLAIN_STATEMENTS, SYSXPLAIN_RESULTSETS, SYSXPLAIN_SCAN_PROPS where 
SYSXPLAIN_STATEMENTS.STMT_ID = SYSXPLAIN_RESULTSETS.STMT_ID and 
SYSXPLAIN_RESULTSETS.SCAN_RS_ID = SYSXPLAIN_SCAN_PROPS.SCAN_RS_ID and 
OP_IDENTIFIER like '%SCAN';






>  Obtain cardinality estimates and true estimates for base tables as well as 
> for intermediate results for queries involving multiple joins. 
> ---
>
> Key: DERBY-6938
> URL: https://issues.apache.org/jira/browse/DERBY-6938
> Project: Derby
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Harshvardhan Gupta
>Assignee: Harshvardhan Gupta
> Attachments: explain.txt
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DERBY-6938) Obtain cardinality estimates and true estimates for base tables as well as for intermediate results for queries involving multiple joins.

2017-06-11 Thread Harshvardhan Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/DERBY-6938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046055#comment-16046055
 ] 

Harshvardhan Gupta commented on DERBY-6938:
---

Hi Bryan,

I am trying to understand the cardinality estimates Derby Optimizer makes in 
the presence of joins, I am currently doing more experiments starting from 
joins for 2 tables and then moving to more levels of joins similar to those 
present in job query dataset. Along with this I am examining the code and will 
share the information that I obtain here.

Regards,
Vardhan,

>  Obtain cardinality estimates and true estimates for base tables as well as 
> for intermediate results for queries involving multiple joins. 
> ---
>
> Key: DERBY-6938
> URL: https://issues.apache.org/jira/browse/DERBY-6938
> Project: Derby
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Harshvardhan Gupta
>Assignee: Harshvardhan Gupta
> Attachments: explain.txt
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (DERBY-6938) Obtain cardinality estimates and true estimates for base tables as well as for intermediate results for queries involving multiple joins.

2017-06-10 Thread Bryan Pendleton (JIRA)

[ 
https://issues.apache.org/jira/browse/DERBY-6938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045796#comment-16045796
 ] 

Bryan Pendleton commented on DERBY-6938:


I'm not certain I understand the question. I suspect we may have to study the 
code, and perform some experiments, to determine the answer.

But we can start by looking at these resources, in addition to the 
documentation you mentioned:
*  https://wiki.apache.org/db-derby/LanguageOptimize
*  http://db.apache.org/derby/papers/optimizer.html

I looked at the explain.txt that you attached, but I'm not certain I am 
following your analysis.

I wonder if you might be able to annotate the explain.txt with your own 
comments, and point directly at the section of the explain output that is of 
particular interest to you, so that I can better understand the question you 
are asking.

Note that it is entirely possible that Derby's query optimizer is not as 
sophisticated as you think it may be. That is one of the reasons that this 
testing is so important: we need to understand where Derby's query optimizer is 
not considering all the information that is available to it when it makes its 
decisions.


>  Obtain cardinality estimates and true estimates for base tables as well as 
> for intermediate results for queries involving multiple joins. 
> ---
>
> Key: DERBY-6938
> URL: https://issues.apache.org/jira/browse/DERBY-6938
> Project: Derby
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Harshvardhan Gupta
>Assignee: Harshvardhan Gupta
> Attachments: explain.txt
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)