Re: branch for cbo work
Seems like folks are in agreement. I will go ahead and create a branch. On Mon, Jun 23, 2014 at 11:40 AM, John Pullokkaran jpullokka...@hortonworks.com wrote: I see that design doc doesn't talk about plug and play aspect of cost model; and it also doesn't make it clear that cost model described is for Join Algorithm selection; also it doesn't have cost model for MR. I will update the doc appropriately. Thanks John -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: branch for cbo work
Created a branch at https://svn.apache.org/repos/asf/hive/branches/cbo Folks interested to contribute feel free to create jiras and provide patches for branch. Lets advance Hive's optimizer by teaching it how to cost different plans! Thanks, Ashutosh On Mon, Jun 23, 2014 at 11:13 PM, Ashutosh Chauhan hashut...@apache.org wrote: Seems like folks are in agreement. I will go ahead and create a branch. On Mon, Jun 23, 2014 at 11:40 AM, John Pullokkaran jpullokka...@hortonworks.com wrote: I see that design doc doesn't talk about plug and play aspect of cost model; and it also doesn't make it clear that cost model described is for Join Algorithm selection; also it doesn't have cost model for MR. I will update the doc appropriately. Thanks John -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: branch for cbo work
Hi Ashutosh, Thanks for your information. I do have some questions, but my concern is more on the design doc than branching. Nevertheless, I think it would be very helpful to clarify in the design before we actually put a lot of effort. From the design doc, it seems that the cost estimation is based on Tez, while the optimization occurs on logical layer. I'd think that CBO are valuable to either engine. If there is anything that's specific to a particular to an engine, then that optimization should stay at engine layer. My original comments was posted on HIVE-5775. Please let me know your thoughts. I'd also like to hear from the community. https://issues.apache.org/jira/browse/HIVE-5775?focusedCommentId=14039987page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14039987 Thanks, Xuefu On Thu, Jun 19, 2014 at 10:34 PM, Ashutosh Chauhan hashut...@apache.org wrote: Hi all, Some of you may have noticed that cost based optimizer work is going on at HIVE-5775 John has put up an initial patch there as well. But there is a lot more work that needs to be done. Following our tradition of large feature work in branch, I propose that we create a branch and commit this patch in it and than continue to work on it in branch to improve it. Hopefully, we can get it in shape so that we can merge it in trunk once its ready. Unless, I hear otherwise I plan to create a branch and commit this initial patch by early next week. Design doc is located here : https://cwiki.apache.org/confluence/display/Hive/Cost-based+optimization+in+Hive Thanks, Ashutosh
Re: branch for cbo work
Following may help in reducing the confusion: 1. In design doc the cost formula is for choosing Join Algorithm. The cost formula as described in the doc assumes Tez execution. 2. However current work on CBO doesn’t include Join algorithm selection. Instead it rearranges Join based on Join cardinality NDV. In other words Join reordering is not depended on Physical Execution Layer (Tez or MR). 3. When we decide to do Join Algorithm Selection we can fit in cost formula for both a) MR b) Tez. This way, based on the physical execution layer we can select best Join Algorithm/Order. 4. The cost formula for Join Algorithm selection is not that different between MR Tez (except for intermediate HDFS writes). So assume that CBO can support both execution layers rather easily. 5. CBO framework allows you to plug and play any cost model. There is no hard coupling. Thanks John On Mon, Jun 23, 2014 at 7:09 AM, Xuefu Zhang xzh...@cloudera.com wrote: Hi Ashutosh, Thanks for your information. I do have some questions, but my concern is more on the design doc than branching. Nevertheless, I think it would be very helpful to clarify in the design before we actually put a lot of effort. From the design doc, it seems that the cost estimation is based on Tez, while the optimization occurs on logical layer. I'd think that CBO are valuable to either engine. If there is anything that's specific to a particular to an engine, then that optimization should stay at engine layer. My original comments was posted on HIVE-5775. Please let me know your thoughts. I'd also like to hear from the community. https://issues.apache.org/jira/browse/HIVE-5775?focusedCommentId=14039987page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14039987 Thanks, Xuefu On Thu, Jun 19, 2014 at 10:34 PM, Ashutosh Chauhan hashut...@apache.org wrote: Hi all, Some of you may have noticed that cost based optimizer work is going on at HIVE-5775 John has put up an initial patch there as well. But there is a lot more work that needs to be done. Following our tradition of large feature work in branch, I propose that we create a branch and commit this patch in it and than continue to work on it in branch to improve it. Hopefully, we can get it in shape so that we can merge it in trunk once its ready. Unless, I hear otherwise I plan to create a branch and commit this initial patch by early next week. Design doc is located here : https://cwiki.apache.org/confluence/display/Hive/Cost-based+optimization+in+Hive Thanks, Ashutosh -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: branch for cbo work
Thanks for the clarification. I'm happily on board with this as long as our approach takes account of the differences between execution engines. While MR and Tez might be similar, there could be new execution engines in the future which might not be that similar. Ideally, all execution engines should benefit from this effort yet room is kept to allow for specific optimizations for a particular engine. It's great if we all see that. Thanks, Xuefu On Mon, Jun 23, 2014 at 11:05 AM, John Pullokkaran jpullokka...@hortonworks.com wrote: Following may help in reducing the confusion: 1. In design doc the cost formula is for choosing Join Algorithm. The cost formula as described in the doc assumes Tez execution. 2. However current work on CBO doesn’t include Join algorithm selection. Instead it rearranges Join based on Join cardinality NDV. In other words Join reordering is not depended on Physical Execution Layer (Tez or MR). 3. When we decide to do Join Algorithm Selection we can fit in cost formula for both a) MR b) Tez. This way, based on the physical execution layer we can select best Join Algorithm/Order. 4. The cost formula for Join Algorithm selection is not that different between MR Tez (except for intermediate HDFS writes). So assume that CBO can support both execution layers rather easily. 5. CBO framework allows you to plug and play any cost model. There is no hard coupling. Thanks John On Mon, Jun 23, 2014 at 7:09 AM, Xuefu Zhang xzh...@cloudera.com wrote: Hi Ashutosh, Thanks for your information. I do have some questions, but my concern is more on the design doc than branching. Nevertheless, I think it would be very helpful to clarify in the design before we actually put a lot of effort. From the design doc, it seems that the cost estimation is based on Tez, while the optimization occurs on logical layer. I'd think that CBO are valuable to either engine. If there is anything that's specific to a particular to an engine, then that optimization should stay at engine layer. My original comments was posted on HIVE-5775. Please let me know your thoughts. I'd also like to hear from the community. https://issues.apache.org/jira/browse/HIVE-5775?focusedCommentId=14039987page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14039987 Thanks, Xuefu On Thu, Jun 19, 2014 at 10:34 PM, Ashutosh Chauhan hashut...@apache.org wrote: Hi all, Some of you may have noticed that cost based optimizer work is going on at HIVE-5775 John has put up an initial patch there as well. But there is a lot more work that needs to be done. Following our tradition of large feature work in branch, I propose that we create a branch and commit this patch in it and than continue to work on it in branch to improve it. Hopefully, we can get it in shape so that we can merge it in trunk once its ready. Unless, I hear otherwise I plan to create a branch and commit this initial patch by early next week. Design doc is located here : https://cwiki.apache.org/confluence/display/Hive/Cost-based+optimization+in+Hive Thanks, Ashutosh -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: branch for cbo work
I see that design doc doesn't talk about plug and play aspect of cost model; and it also doesn't make it clear that cost model described is for Join Algorithm selection; also it doesn't have cost model for MR. I will update the doc appropriately. Thanks John -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
branch for cbo work
Hi all, Some of you may have noticed that cost based optimizer work is going on at HIVE-5775 John has put up an initial patch there as well. But there is a lot more work that needs to be done. Following our tradition of large feature work in branch, I propose that we create a branch and commit this patch in it and than continue to work on it in branch to improve it. Hopefully, we can get it in shape so that we can merge it in trunk once its ready. Unless, I hear otherwise I plan to create a branch and commit this initial patch by early next week. Design doc is located here : https://cwiki.apache.org/confluence/display/Hive/Cost-based+optimization+in+Hive Thanks, Ashutosh