Re: branch for cbo work

2014-06-24 Thread Ashutosh Chauhan
Seems like folks are in agreement. I will go ahead and create a branch.


On Mon, Jun 23, 2014 at 11:40 AM, John Pullokkaran 
jpullokka...@hortonworks.com wrote:

 I see that design doc doesn't talk about plug and play aspect of cost
 model; and it also doesn't make it clear that cost model described is for
 Join Algorithm selection; also it doesn't have cost model for MR.

 I will update the doc appropriately.

 Thanks
 John

 --
 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity to
 which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender immediately
 and delete it from your system. Thank You.



Re: branch for cbo work

2014-06-24 Thread Ashutosh Chauhan
Created a branch  at https://svn.apache.org/repos/asf/hive/branches/cbo

Folks interested to contribute feel free to create jiras and provide
patches for branch.
Lets advance Hive's optimizer by teaching it how to cost different plans!

Thanks,
Ashutosh


On Mon, Jun 23, 2014 at 11:13 PM, Ashutosh Chauhan hashut...@apache.org
wrote:

 Seems like folks are in agreement. I will go ahead and create a branch.


 On Mon, Jun 23, 2014 at 11:40 AM, John Pullokkaran 
 jpullokka...@hortonworks.com wrote:

 I see that design doc doesn't talk about plug and play aspect of cost
 model; and it also doesn't make it clear that cost model described is for
 Join Algorithm selection; also it doesn't have cost model for MR.

 I will update the doc appropriately.

 Thanks
 John

 --
 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity
 to
 which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified
 that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender
 immediately
 and delete it from your system. Thank You.





Re: branch for cbo work

2014-06-23 Thread Xuefu Zhang
Hi Ashutosh,

Thanks for your information. I do have some questions, but my concern is
more on the design doc than branching. Nevertheless, I think it would be
very helpful to clarify in the design before we actually put a lot of
effort.

From the design doc, it seems that the cost estimation is based on Tez,
while the optimization occurs on logical layer. I'd think that CBO are
valuable to either engine. If there is anything that's specific to a
particular to an engine, then that optimization should stay at engine layer.

My original comments was posted on HIVE-5775. Please let me know your
thoughts. I'd also like to hear from the community.

https://issues.apache.org/jira/browse/HIVE-5775?focusedCommentId=14039987page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14039987

Thanks,
Xuefu


On Thu, Jun 19, 2014 at 10:34 PM, Ashutosh Chauhan hashut...@apache.org
wrote:

 Hi all,

 Some of you may have noticed that cost based optimizer work is going on at
 HIVE-5775 John has put up an initial patch there as well. But there is a
 lot more work that needs to be done. Following our tradition of large
 feature work in branch, I propose that we create a branch and commit this
 patch in it and than continue to work on it in branch to improve it.
 Hopefully, we can get it in shape so that we can merge it in trunk once its
 ready.  Unless, I hear otherwise I plan to create a branch and commit this
 initial patch by early next week.


 Design doc is located here :

 https://cwiki.apache.org/confluence/display/Hive/Cost-based+optimization+in+Hive


 Thanks,

 Ashutosh



Re: branch for cbo work

2014-06-23 Thread John Pullokkaran
Following may help in reducing the confusion:

1. In design doc the cost formula is for choosing Join Algorithm. The cost
formula as described in the doc assumes Tez execution.

2. However current work on CBO doesn’t include Join algorithm selection.
Instead it rearranges Join based on Join cardinality  NDV. In other words
Join reordering is not depended on Physical Execution Layer (Tez or MR).

3. When we decide to do Join Algorithm Selection we can fit in cost formula
for both a) MR b) Tez. This way, based on the physical execution layer we
can select best Join Algorithm/Order.

4. The cost formula for Join Algorithm selection is not that different
between MR  Tez (except for intermediate HDFS writes). So assume that CBO
can support both execution layers rather easily.

5. CBO framework allows you to plug and play any cost model. There is no
hard coupling.

Thanks
John



On Mon, Jun 23, 2014 at 7:09 AM, Xuefu Zhang xzh...@cloudera.com wrote:

 Hi Ashutosh,

 Thanks for your information. I do have some questions, but my concern is
 more on the design doc than branching. Nevertheless, I think it would be
 very helpful to clarify in the design before we actually put a lot of
 effort.

 From the design doc, it seems that the cost estimation is based on Tez,
 while the optimization occurs on logical layer. I'd think that CBO are
 valuable to either engine. If there is anything that's specific to a
 particular to an engine, then that optimization should stay at engine
 layer.

 My original comments was posted on HIVE-5775. Please let me know your
 thoughts. I'd also like to hear from the community.


 https://issues.apache.org/jira/browse/HIVE-5775?focusedCommentId=14039987page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14039987

 Thanks,
 Xuefu


 On Thu, Jun 19, 2014 at 10:34 PM, Ashutosh Chauhan hashut...@apache.org
 wrote:

  Hi all,
 
  Some of you may have noticed that cost based optimizer work is going on
 at
  HIVE-5775 John has put up an initial patch there as well. But there is a
  lot more work that needs to be done. Following our tradition of large
  feature work in branch, I propose that we create a branch and commit this
  patch in it and than continue to work on it in branch to improve it.
  Hopefully, we can get it in shape so that we can merge it in trunk once
 its
  ready.  Unless, I hear otherwise I plan to create a branch and commit
 this
  initial patch by early next week.
 
 
  Design doc is located here :
 
 
 https://cwiki.apache.org/confluence/display/Hive/Cost-based+optimization+in+Hive
 
 
  Thanks,
 
  Ashutosh
 


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


Re: branch for cbo work

2014-06-23 Thread Xuefu Zhang
Thanks for the clarification. I'm happily on board with this as long as our
approach takes account of the differences between execution engines. While
MR and Tez might be similar, there could be new execution engines in the
future which might not be that similar. Ideally, all execution engines
should benefit from this effort yet room is kept to allow for specific
optimizations for a particular engine. It's great if we all see that.

Thanks,
Xuefu


On Mon, Jun 23, 2014 at 11:05 AM, John Pullokkaran 
jpullokka...@hortonworks.com wrote:

 Following may help in reducing the confusion:

 1. In design doc the cost formula is for choosing Join Algorithm. The cost
 formula as described in the doc assumes Tez execution.

 2. However current work on CBO doesn’t include Join algorithm selection.
 Instead it rearranges Join based on Join cardinality  NDV. In other words
 Join reordering is not depended on Physical Execution Layer (Tez or MR).

 3. When we decide to do Join Algorithm Selection we can fit in cost formula
 for both a) MR b) Tez. This way, based on the physical execution layer we
 can select best Join Algorithm/Order.

 4. The cost formula for Join Algorithm selection is not that different
 between MR  Tez (except for intermediate HDFS writes). So assume that CBO
 can support both execution layers rather easily.

 5. CBO framework allows you to plug and play any cost model. There is no
 hard coupling.

 Thanks
 John



 On Mon, Jun 23, 2014 at 7:09 AM, Xuefu Zhang xzh...@cloudera.com wrote:

  Hi Ashutosh,
 
  Thanks for your information. I do have some questions, but my concern is
  more on the design doc than branching. Nevertheless, I think it would be
  very helpful to clarify in the design before we actually put a lot of
  effort.
 
  From the design doc, it seems that the cost estimation is based on Tez,
  while the optimization occurs on logical layer. I'd think that CBO are
  valuable to either engine. If there is anything that's specific to a
  particular to an engine, then that optimization should stay at engine
  layer.
 
  My original comments was posted on HIVE-5775. Please let me know your
  thoughts. I'd also like to hear from the community.
 
 
 
 https://issues.apache.org/jira/browse/HIVE-5775?focusedCommentId=14039987page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14039987
 
  Thanks,
  Xuefu
 
 
  On Thu, Jun 19, 2014 at 10:34 PM, Ashutosh Chauhan hashut...@apache.org
 
  wrote:
 
   Hi all,
  
   Some of you may have noticed that cost based optimizer work is going on
  at
   HIVE-5775 John has put up an initial patch there as well. But there is
 a
   lot more work that needs to be done. Following our tradition of large
   feature work in branch, I propose that we create a branch and commit
 this
   patch in it and than continue to work on it in branch to improve it.
   Hopefully, we can get it in shape so that we can merge it in trunk once
  its
   ready.  Unless, I hear otherwise I plan to create a branch and commit
  this
   initial patch by early next week.
  
  
   Design doc is located here :
  
  
 
 https://cwiki.apache.org/confluence/display/Hive/Cost-based+optimization+in+Hive
  
  
   Thanks,
  
   Ashutosh
  
 

 --
 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity to
 which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender immediately
 and delete it from your system. Thank You.



Re: branch for cbo work

2014-06-23 Thread John Pullokkaran
I see that design doc doesn't talk about plug and play aspect of cost
model; and it also doesn't make it clear that cost model described is for
Join Algorithm selection; also it doesn't have cost model for MR.

I will update the doc appropriately.

Thanks
John

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


branch for cbo work

2014-06-19 Thread Ashutosh Chauhan
Hi all,

Some of you may have noticed that cost based optimizer work is going on at
HIVE-5775 John has put up an initial patch there as well. But there is a
lot more work that needs to be done. Following our tradition of large
feature work in branch, I propose that we create a branch and commit this
patch in it and than continue to work on it in branch to improve it.
Hopefully, we can get it in shape so that we can merge it in trunk once its
ready.  Unless, I hear otherwise I plan to create a branch and commit this
initial patch by early next week.


Design doc is located here :
https://cwiki.apache.org/confluence/display/Hive/Cost-based+optimization+in+Hive


Thanks,

Ashutosh