[jira] [Commented] (DERBY-6938) Obtain cardinality estimates and true estimates for base tables as well as for intermediate results for queries involving multiple joins.
[ https://issues.apache.org/jira/browse/DERBY-6938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16110605#comment-16110605 ] Harshvardhan Gupta commented on DERBY-6938: --- Bryan, I agree with your observations. For production of efficient query plans, Cardinality Estimates, Cost estimation and Plan space are all inter dependent on each other. I am filing another issue under DERBY-6949 to track the effort on modelling HASH join's spill behaviour. I will also update cost modelling details under DERBY-6949 shortly after I had to take a detour analyzing HASH Join's behaviour. Thanks, Vardhan > Obtain cardinality estimates and true estimates for base tables as well as > for intermediate results for queries involving multiple joins. > --- > > Key: DERBY-6938 > URL: https://issues.apache.org/jira/browse/DERBY-6938 > Project: Derby > Issue Type: Sub-task > Components: SQL >Reporter: Harshvardhan Gupta >Assignee: Harshvardhan Gupta > Attachments: explain.txt, traceout.txt > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DERBY-6938) Obtain cardinality estimates and true estimates for base tables as well as for intermediate results for queries involving multiple joins.
[ https://issues.apache.org/jira/browse/DERBY-6938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16097372#comment-16097372 ] Bryan Pendleton commented on DERBY-6938: Thank you for exploring this behavior in more detail, and for the clear explanation. It is very helpful! Perhaps the critical question here involves the Optimizer's prediction about whether the intermediate results will fit in memory or not. It seems like that is the area where the quality and accuracy of the cardinality and selectivity estimates is crucial, since if those estimates are poor, the Optimizer will have an inaccurate prediction about whether the intermediate results will fit into memory or not, and it might avoid a hash join in a case where it would in fact be the best approach, or vice versa might choose a hash join in a case where the intermediate results are in fact very large and thus the hash join will perform very poorly. > Obtain cardinality estimates and true estimates for base tables as well as > for intermediate results for queries involving multiple joins. > --- > > Key: DERBY-6938 > URL: https://issues.apache.org/jira/browse/DERBY-6938 > Project: Derby > Issue Type: Sub-task > Components: SQL >Reporter: Harshvardhan Gupta >Assignee: Harshvardhan Gupta > Attachments: explain.txt, traceout.txt > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DERBY-6938) Obtain cardinality estimates and true estimates for base tables as well as for intermediate results for queries involving multiple joins.
[ https://issues.apache.org/jira/browse/DERBY-6938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16097272#comment-16097272 ] Harshvardhan Gupta commented on DERBY-6938: --- Summarizing my analysis of HASH based joins so far - Derby has the capability to perform HASH Joins spilling to disk but has long ignored them due to the absence of a cost model around such a behaviour. DERBY-1259 also talks about the history of HASH Joins and the changes made as part of DERBY-106. > Obtain cardinality estimates and true estimates for base tables as well as > for intermediate results for queries involving multiple joins. > --- > > Key: DERBY-6938 > URL: https://issues.apache.org/jira/browse/DERBY-6938 > Project: Derby > Issue Type: Sub-task > Components: SQL >Reporter: Harshvardhan Gupta >Assignee: Harshvardhan Gupta > Attachments: explain.txt, traceout.txt > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DERBY-6938) Obtain cardinality estimates and true estimates for base tables as well as for intermediate results for queries involving multiple joins.
[ https://issues.apache.org/jira/browse/DERBY-6938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16097260#comment-16097260 ] Harshvardhan Gupta commented on DERBY-6938: --- Hi Bryan, The Optimizer produces memory and cost estimates for all the possible access paths that it can build. Derby will only consider those HASH Joins to go through which it _predicts _ will fit into memory and rejects all others (including those which will be very efficient for some queries as opposed to loop based joins). However, the memory predictions of Derby may not match memory requirement at execution and HASH table may become larger than what Derby predicted. DERBY-106 was an effort to mitigate the OOM exceptions arising out of this behaviour. So although HASH tables do spill to disk but only for those access path go forward to execution where Derby predicts that the HASH Tables will not spill to disk. The above observations of mine match with the comment on DERBY-1259 which was written a decade ago. Coming to the resolution of this problem, we need to investigate how to measure the cost of execution when Derby will eventually allow those HASH based access paths that it predicts will spill to disk instead of simply ignoring them. Also, interesting is the observation that Derby currently allows hash based joins when users specify their join strategies via Query Hints. > Obtain cardinality estimates and true estimates for base tables as well as > for intermediate results for queries involving multiple joins. > --- > > Key: DERBY-6938 > URL: https://issues.apache.org/jira/browse/DERBY-6938 > Project: Derby > Issue Type: Sub-task > Components: SQL >Reporter: Harshvardhan Gupta >Assignee: Harshvardhan Gupta > Attachments: explain.txt > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DERBY-6938) Obtain cardinality estimates and true estimates for base tables as well as for intermediate results for queries involving multiple joins.
[ https://issues.apache.org/jira/browse/DERBY-6938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16088752#comment-16088752 ] Bryan Pendleton commented on DERBY-6938: Hi Vardhan, I think your observation about the Optimizer being unwilling to choose Hash Join plans if it believes they will not fit in memory is very interesting. Our documentation (https://db.apache.org/derby/docs/10.13/tuning/ctunoptimz23173.html) definitely claims that Derby will choose a hash join, and will spill to disk if the join is larger than will fit in memory, and I believe that was definitely the intent of DERBY-106. But it appears that it isn't working as we expect. In researching this, I came across a very interesting discussion in DERBY-4620. Perhaps you could look through the DERBY-4620 work, and see if you have any additional observations to share regarding the behavior that you see? > Obtain cardinality estimates and true estimates for base tables as well as > for intermediate results for queries involving multiple joins. > --- > > Key: DERBY-6938 > URL: https://issues.apache.org/jira/browse/DERBY-6938 > Project: Derby > Issue Type: Sub-task > Components: SQL >Reporter: Harshvardhan Gupta >Assignee: Harshvardhan Gupta > Attachments: explain.txt > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DERBY-6938) Obtain cardinality estimates and true estimates for base tables as well as for intermediate results for queries involving multiple joins.
[ https://issues.apache.org/jira/browse/DERBY-6938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16053197#comment-16053197 ] Harshvardhan Gupta commented on DERBY-6938: --- Tracking work on statistics enhancement at https://issues.apache.org/jira/browse/DERBY-6940 > Obtain cardinality estimates and true estimates for base tables as well as > for intermediate results for queries involving multiple joins. > --- > > Key: DERBY-6938 > URL: https://issues.apache.org/jira/browse/DERBY-6938 > Project: Derby > Issue Type: Sub-task > Components: SQL >Reporter: Harshvardhan Gupta >Assignee: Harshvardhan Gupta > Attachments: explain.txt > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DERBY-6938) Obtain cardinality estimates and true estimates for base tables as well as for intermediate results for queries involving multiple joins.
[ https://issues.apache.org/jira/browse/DERBY-6938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16048798#comment-16048798 ] Harshvardhan Gupta commented on DERBY-6938: --- The specific approach I am thinking is to keep the minimum and maximum value of columns and number of NULL values in statistics, this could be utilised in operators such as (< , > , <=, >=, IS NOT NULL, NULL) etc. For example, lets say we have a int column and the minimum and maximum value is 20 and 100 respectively. Then for a query predicate on that column with the condition that >=80 should ideally return 25% of all columns. This approach obviously assumes an uniform distribution but should be good to get started with. We should be able to make it more efficient by taking into account distribution later on. > Obtain cardinality estimates and true estimates for base tables as well as > for intermediate results for queries involving multiple joins. > --- > > Key: DERBY-6938 > URL: https://issues.apache.org/jira/browse/DERBY-6938 > Project: Derby > Issue Type: Sub-task > Components: SQL >Reporter: Harshvardhan Gupta >Assignee: Harshvardhan Gupta > Attachments: explain.txt > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DERBY-6938) Obtain cardinality estimates and true estimates for base tables as well as for intermediate results for queries involving multiple joins.
[ https://issues.apache.org/jira/browse/DERBY-6938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16048793#comment-16048793 ] Harshvardhan Gupta commented on DERBY-6938: --- Bryan, Regarding my doubt earlier, one thing that was particularly useful to dive deep into the optimizer was to enable optimizer tracing. https://wiki.apache.org/db-derby/OptimizerTracing The trace output is quite verbose and helps to understand the various choices the optimizer is making. Few observations and scope of improvements that I would like to point out - 1) Derby falls back to nested loops more often that we would like to particularly in case of large tables, currently the hash table resides entirely in memory and derby rules out the HASHJOIN approach if it suspects that it is going to be too large (default is 1048576) Nested loops do not seem to be a good option specially when joining relatively large tables (similar to imdb dataset we are using) across more than 4 joins. It is also documented in the optimizer paper that creating hash tables that spill to disk is a potential improvement and my experiments confirm that. 2) Another potential improvement with regards to cardinality estimates. Derby currently uses hard wired numbers for every operator other than the equality op for selectivity. https://db.apache.org/derby/docs/10.0/manuals/tuning/perf56.html In case of equality operator with a known value at compile time, it utilises statistics and make selectivity assumptions using number of unique values. I think we can enhance the statistics to be able to make better cardinality estimates. > Obtain cardinality estimates and true estimates for base tables as well as > for intermediate results for queries involving multiple joins. > --- > > Key: DERBY-6938 > URL: https://issues.apache.org/jira/browse/DERBY-6938 > Project: Derby > Issue Type: Sub-task > Components: SQL >Reporter: Harshvardhan Gupta >Assignee: Harshvardhan Gupta > Attachments: explain.txt > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DERBY-6938) Obtain cardinality estimates and true estimates for base tables as well as for intermediate results for queries involving multiple joins.
[ https://issues.apache.org/jira/browse/DERBY-6938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16048780#comment-16048780 ] Harshvardhan Gupta commented on DERBY-6938: --- To view and compare the estimates row counts and true row counts for base tables and for intermediate results the following queries can be used on xplain tables, the value of OP_IDENTIFIER can be changed to get data for nodes for a particular operation such as HASHJOIN, NLJOIN etc select SEEN_ROWS, SEEN_ROWS_RIGHT, RETURNED_ROWS, EST_ROW_COUNT from SYSXPLAIN_RESULTSETS,SYSXPLAIN_STATEMENTS where OP_IDENTIFIER = 'HASHJOIN' and SYSXPLAIN_STATEMENTS.STMT_ID = SYSXPLAIN_RESULTSETS.STMT_ID; To couple the scan information for nodes involved in scans - select SEEN_ROWS, RETURNED_ROWS, EST_ROW_COUNT , OP_IDENTIFIER from SYSXPLAIN_STATEMENTS, SYSXPLAIN_RESULTSETS, SYSXPLAIN_SCAN_PROPS where SYSXPLAIN_STATEMENTS.STMT_ID = SYSXPLAIN_RESULTSETS.STMT_ID and SYSXPLAIN_RESULTSETS.SCAN_RS_ID = SYSXPLAIN_SCAN_PROPS.SCAN_RS_ID and OP_IDENTIFIER like '%SCAN'; > Obtain cardinality estimates and true estimates for base tables as well as > for intermediate results for queries involving multiple joins. > --- > > Key: DERBY-6938 > URL: https://issues.apache.org/jira/browse/DERBY-6938 > Project: Derby > Issue Type: Sub-task > Components: SQL >Reporter: Harshvardhan Gupta >Assignee: Harshvardhan Gupta > Attachments: explain.txt > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DERBY-6938) Obtain cardinality estimates and true estimates for base tables as well as for intermediate results for queries involving multiple joins.
[ https://issues.apache.org/jira/browse/DERBY-6938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046055#comment-16046055 ] Harshvardhan Gupta commented on DERBY-6938: --- Hi Bryan, I am trying to understand the cardinality estimates Derby Optimizer makes in the presence of joins, I am currently doing more experiments starting from joins for 2 tables and then moving to more levels of joins similar to those present in job query dataset. Along with this I am examining the code and will share the information that I obtain here. Regards, Vardhan, > Obtain cardinality estimates and true estimates for base tables as well as > for intermediate results for queries involving multiple joins. > --- > > Key: DERBY-6938 > URL: https://issues.apache.org/jira/browse/DERBY-6938 > Project: Derby > Issue Type: Sub-task > Components: SQL >Reporter: Harshvardhan Gupta >Assignee: Harshvardhan Gupta > Attachments: explain.txt > > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DERBY-6938) Obtain cardinality estimates and true estimates for base tables as well as for intermediate results for queries involving multiple joins.
[ https://issues.apache.org/jira/browse/DERBY-6938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045796#comment-16045796 ] Bryan Pendleton commented on DERBY-6938: I'm not certain I understand the question. I suspect we may have to study the code, and perform some experiments, to determine the answer. But we can start by looking at these resources, in addition to the documentation you mentioned: * https://wiki.apache.org/db-derby/LanguageOptimize * http://db.apache.org/derby/papers/optimizer.html I looked at the explain.txt that you attached, but I'm not certain I am following your analysis. I wonder if you might be able to annotate the explain.txt with your own comments, and point directly at the section of the explain output that is of particular interest to you, so that I can better understand the question you are asking. Note that it is entirely possible that Derby's query optimizer is not as sophisticated as you think it may be. That is one of the reasons that this testing is so important: we need to understand where Derby's query optimizer is not considering all the information that is available to it when it makes its decisions. > Obtain cardinality estimates and true estimates for base tables as well as > for intermediate results for queries involving multiple joins. > --- > > Key: DERBY-6938 > URL: https://issues.apache.org/jira/browse/DERBY-6938 > Project: Derby > Issue Type: Sub-task > Components: SQL >Reporter: Harshvardhan Gupta >Assignee: Harshvardhan Gupta > Attachments: explain.txt > > -- This message was sent by Atlassian JIRA (v6.3.15#6346)