Re: Tez branch and tez based patches
I still am not sure we are doing this the ideal way. I am not a believer in a commit-then-review branch. This issue is an example. https://issues.apache.org/jira/browse/HIVE-5108 I ask myself these questions: Does this currently work? Are their tests? If so which ones are broken? How does the patch fix them without tests to validate? Having a commit-then-review branch just seems subversive to our normal process, and a quick short cut to not have to be bothered by writing tests or involving anyone else. On Mon, Aug 5, 2013 at 1:54 PM, Alan Gates ga...@hortonworks.com wrote: On Jul 29, 2013, at 9:53 PM, Edward Capriolo wrote: Also watched http://www.ustream.tv/recorded/36323173 I definitely see the win in being able to stream inter-stage output. I see some cases where small intermediate results can be kept In memory. But I was somewhat under the impression that the map reduce spill settings kept stuff in memory, isn't that what spill settings are? No. MapReduce always writes shuffle data to local disk. And intermediate results between MR jobs are always persisted to HDFS, as there's no other option. When we talk of being able to keep intermediate results in memory we mean getting rid of both of these disk writes/reads when appropriate (meaning not always, there's a trade off between speed and error handling to be made here, see below for more details). There is a few bullet points that came up repeatedly that I do not follow: Something was said to the effect of Container reuse makes X faster. Hadoop has jvm reuse. Not following what the difference is here? Not everyone has a 10K node cluster. Sharing JVMs across users is inherently insecure (we can't guarantee what code the first user left behind that may interfere with later users). As I understand container re-use in Tez it constrains the re-use to one user for security reasons, but still avoids additional JVM start up costs. But this is a question that the Tez guys could answer better on the Tez lists ( d...@tez.incubator.apache.org) Joins in map reduce are hard Really? I mean some of them are I guess, but the typical join is very easy. Just shuffle by the join key. There was not really enough low level details here saying why joins are better in tez. Join is not a natural operation in MapReduce. MR gives you one input and one output. You end up having to bend the rules to do have multiple inputs. The idea here is that Tez can provide operators that naturally work with joins and other operations that don't fit the one input/one output model (eg unions, etc.). Chosing the number of maps and reduces is hard Really? I do not find it that hard, I think there are times when it's not perfect but I do not find it hard. The talk did not really offer anything here technical on how tez makes this better other then it could make it better. Perhaps manual would be a better term here than hard. In our experience it takes quite a bit of engineer trial and error to determine the optimal numbers. This may be ok if you're going to invest the time once and then run the same query every day for 6 months. But obviously it doesn't work for the ad hoc case. Even in the batch case it's not optimal because every once and a while an engineer has to go back and re-optimize the query to deal with changing data sizes, data characteristics, etc. We want the optimizer to handle this without human intervention. The presentations mentioned streaming data, how do two nodes stream data between a tasks and how it it reliable? If the sender or receiver dies does the entire process have to start again? If the sender or receiver dies then the query has to be restarted from some previous point where data was persisted to disk. The idea here is that speed vs error recovery trade offs should be made by the optimizer. If the optimizer estimates that a query will complete in 5 seconds it can stream everything and if a node fails it just re-runs the whole query. If it estimates that a particular phase of a query will run for an hour it can choose to persist the results to HDFS so that in the event of a failure downstream the long phase need not be re-run. Again we want this to be done automatically by the system so the user doesn't need to control this level of detail. Again one of the talks implied there is a prototype out there that launches hive jobs into tez. I would like to see that, it might answer more questions then a power point, and I could profile some common queries. As mentioned in a previous email afaik Gunther's pushed all these changes to the Tez branch in Hive. Alan. Random late night thoughts over, Ed On Tue, Jul 30, 2013 at 12:02 AM, Edward Capriolo edlinuxg...@gmail.com wrote: At ~25:00 There is a working prototype of hive which is using tez as the targeted runtime Can I get a look at that code? Is
Re: Tez branch and tez based patches
Commit then review, and self commit, destroys the good things we get from our normal system. http://anna.gs/blog/2013/08/12/code-review-ftw/ I am most worried about silo's and knowledge, lax testing policies, and code quality. Which I now have seen on several occasions when something is happening in a branch. (not calling out tez branch in particular) On Fri, Aug 16, 2013 at 9:13 AM, Edward Capriolo edlinuxg...@gmail.comwrote: I still am not sure we are doing this the ideal way. I am not a believer in a commit-then-review branch. This issue is an example. https://issues.apache.org/jira/browse/HIVE-5108 I ask myself these questions: Does this currently work? Are their tests? If so which ones are broken? How does the patch fix them without tests to validate? Having a commit-then-review branch just seems subversive to our normal process, and a quick short cut to not have to be bothered by writing tests or involving anyone else. On Mon, Aug 5, 2013 at 1:54 PM, Alan Gates ga...@hortonworks.com wrote: On Jul 29, 2013, at 9:53 PM, Edward Capriolo wrote: Also watched http://www.ustream.tv/recorded/36323173 I definitely see the win in being able to stream inter-stage output. I see some cases where small intermediate results can be kept In memory. But I was somewhat under the impression that the map reduce spill settings kept stuff in memory, isn't that what spill settings are? No. MapReduce always writes shuffle data to local disk. And intermediate results between MR jobs are always persisted to HDFS, as there's no other option. When we talk of being able to keep intermediate results in memory we mean getting rid of both of these disk writes/reads when appropriate (meaning not always, there's a trade off between speed and error handling to be made here, see below for more details). There is a few bullet points that came up repeatedly that I do not follow: Something was said to the effect of Container reuse makes X faster. Hadoop has jvm reuse. Not following what the difference is here? Not everyone has a 10K node cluster. Sharing JVMs across users is inherently insecure (we can't guarantee what code the first user left behind that may interfere with later users). As I understand container re-use in Tez it constrains the re-use to one user for security reasons, but still avoids additional JVM start up costs. But this is a question that the Tez guys could answer better on the Tez lists ( d...@tez.incubator.apache.org) Joins in map reduce are hard Really? I mean some of them are I guess, but the typical join is very easy. Just shuffle by the join key. There was not really enough low level details here saying why joins are better in tez. Join is not a natural operation in MapReduce. MR gives you one input and one output. You end up having to bend the rules to do have multiple inputs. The idea here is that Tez can provide operators that naturally work with joins and other operations that don't fit the one input/one output model (eg unions, etc.). Chosing the number of maps and reduces is hard Really? I do not find it that hard, I think there are times when it's not perfect but I do not find it hard. The talk did not really offer anything here technical on how tez makes this better other then it could make it better. Perhaps manual would be a better term here than hard. In our experience it takes quite a bit of engineer trial and error to determine the optimal numbers. This may be ok if you're going to invest the time once and then run the same query every day for 6 months. But obviously it doesn't work for the ad hoc case. Even in the batch case it's not optimal because every once and a while an engineer has to go back and re-optimize the query to deal with changing data sizes, data characteristics, etc. We want the optimizer to handle this without human intervention. The presentations mentioned streaming data, how do two nodes stream data between a tasks and how it it reliable? If the sender or receiver dies does the entire process have to start again? If the sender or receiver dies then the query has to be restarted from some previous point where data was persisted to disk. The idea here is that speed vs error recovery trade offs should be made by the optimizer. If the optimizer estimates that a query will complete in 5 seconds it can stream everything and if a node fails it just re-runs the whole query. If it estimates that a particular phase of a query will run for an hour it can choose to persist the results to HDFS so that in the event of a failure downstream the long phase need not be re-run. Again we want this to be done automatically by the system so the user doesn't need to control this level of detail. Again one of the talks implied there is a prototype out there that launches hive jobs into tez. I would like to see that, it might answer more
Re: Tez branch and tez based patches
Which talk are you referencing here? AFAIK all the Hive code we've written is being pushed back into the Tez branch, so you should be able to see it there. Alan. On Jul 29, 2013, at 9:02 PM, Edward Capriolo wrote: At ~25:00 There is a working prototype of hive which is using tez as the targeted runtime Can I get a look at that code? Is it on github? Edward On Wed, Jul 17, 2013 at 3:35 PM, Alan Gates ga...@hortonworks.com wrote: Answers to some of your questions inlined. Alan. On Jul 16, 2013, at 10:20 PM, Edward Capriolo wrote: There are some points I want to bring up. First, I am on the PMC. Here is something I find relevant: http://www.apache.org/foundation/how-it-works.html -- The role of the PMC from a Foundation perspective is oversight. The main role of the PMC is not code and not coding - but to ensure that all legal issues are addressed, that procedure is followed, and that each and every release is the product of the community as a whole. That is key to our litigation protection mechanisms. Secondly the role of the PMC is to further the long term development and health of the community as a whole, and to ensure that balanced and wide scale peer review and collaboration does happen. Within the ASF we worry about any community which centers around a few individuals who are working virtually uncontested. We believe that this is detrimental to quality, stability, and robustness of both code and long term social structures. https://blogs.apache.org/comdev/entry/what_makes_apache_projects_different - All other decisions happen on the dev list, discussions on the private list are kept to a minimum. If it didn't happen on the dev list, it didn't happen - which leads to: a) Elections of committers and PMC members are published on the dev list once finalized. b) Out-of-band discussions (IRC etc.) are summarized on the dev list as soon as they have impact on the project, code or community. - https://issues.apache.org/jira/browse/HIVE-4660 ironically titled Let their be Tez has not be +1 ed by any committer. It was never discussed on the dev or the user list (as far as I can tell). As all JIRA creations and updates are sent to dev@hive, creating a JIRA is de facto posting to the list. As a PMC member I feel we need more discussion on Tez on the dev list along with a wiki-fied design document. Topics of discussion should include: I talked with Gunther and he's working on posting a design doc on the wiki. He has a PDF on the JIRA but he doesn't have write permissions yet on the wiki. 1) What is tez? In Hadoop 2.0, YARN opens up the ability to have multiple execution frameworks in Hadoop. Hadoop apps are no longer tied to MapReduce as the only execution option. Tez is an effort to build an execution engine that is optimized for relational data processing, such as Hive and Pig. The biggest change here is to move away from only Map and Reduce as processing options and to allow alternate combinations of processing, such as map - reduce - reduce or tasks that take multiple inputs or shuffles that avoid sorting when it isn't needed. For a good intro to Tez, see Arun's presentation on it at the recent Hadoop summit (video http://www.youtube.com/watch?v=9ZLLzlsz7h8 slides http://www.slideshare.net/Hadoop_Summit/murhty-saha-june26255pmroom212) 2) How is tez different from oozie, http://code.google.com/p/hop/, http://cs.brown.edu/~backman/cmr.html , and other DAG and or streaming map reduce tools/frameworks? Why should we use this and not those? Oozie is a completely different thing. Oozie is a workflow engine and a scheduler. It's core competencies are the ability to coordinate workflows of disparate job types (MR, Pig, Hive, etc.) and to schedule them. It is not intended as an execution engine for apps such as Pig and Hive. I am not familiar with these other engines, but the short answer is that Tez is built to work on YARN, which works well for Hive since it is tied to Hadoop. 3) When can we expect the first tez release? I don't know, but I hope sometime this fall. 4) How much effort is involved in integrating hive and tez? Covered in the design doc. 5) Who is ready to commit to this effort? I'll let people speak for themselves on that one. 6) can we expect this work to be done in one hive release? Unlikely. Initial integration will be done in one release, but as Tez is a new project I expect it will be adding features in the future that Hive will want to take advantage of. In my opinion we should not start any work on this tez-hive until these questions are answered to the satisfaction of the hive developers. Can we change this to not commit patches? We can't tell willing people not to work on it.
Re: Tez branch and tez based patches
On Jul 29, 2013, at 9:53 PM, Edward Capriolo wrote: Also watched http://www.ustream.tv/recorded/36323173 I definitely see the win in being able to stream inter-stage output. I see some cases where small intermediate results can be kept In memory. But I was somewhat under the impression that the map reduce spill settings kept stuff in memory, isn't that what spill settings are? No. MapReduce always writes shuffle data to local disk. And intermediate results between MR jobs are always persisted to HDFS, as there's no other option. When we talk of being able to keep intermediate results in memory we mean getting rid of both of these disk writes/reads when appropriate (meaning not always, there's a trade off between speed and error handling to be made here, see below for more details). There is a few bullet points that came up repeatedly that I do not follow: Something was said to the effect of Container reuse makes X faster. Hadoop has jvm reuse. Not following what the difference is here? Not everyone has a 10K node cluster. Sharing JVMs across users is inherently insecure (we can't guarantee what code the first user left behind that may interfere with later users). As I understand container re-use in Tez it constrains the re-use to one user for security reasons, but still avoids additional JVM start up costs. But this is a question that the Tez guys could answer better on the Tez lists (d...@tez.incubator.apache.org) Joins in map reduce are hard Really? I mean some of them are I guess, but the typical join is very easy. Just shuffle by the join key. There was not really enough low level details here saying why joins are better in tez. Join is not a natural operation in MapReduce. MR gives you one input and one output. You end up having to bend the rules to do have multiple inputs. The idea here is that Tez can provide operators that naturally work with joins and other operations that don't fit the one input/one output model (eg unions, etc.). Chosing the number of maps and reduces is hard Really? I do not find it that hard, I think there are times when it's not perfect but I do not find it hard. The talk did not really offer anything here technical on how tez makes this better other then it could make it better. Perhaps manual would be a better term here than hard. In our experience it takes quite a bit of engineer trial and error to determine the optimal numbers. This may be ok if you're going to invest the time once and then run the same query every day for 6 months. But obviously it doesn't work for the ad hoc case. Even in the batch case it's not optimal because every once and a while an engineer has to go back and re-optimize the query to deal with changing data sizes, data characteristics, etc. We want the optimizer to handle this without human intervention. The presentations mentioned streaming data, how do two nodes stream data between a tasks and how it it reliable? If the sender or receiver dies does the entire process have to start again? If the sender or receiver dies then the query has to be restarted from some previous point where data was persisted to disk. The idea here is that speed vs error recovery trade offs should be made by the optimizer. If the optimizer estimates that a query will complete in 5 seconds it can stream everything and if a node fails it just re-runs the whole query. If it estimates that a particular phase of a query will run for an hour it can choose to persist the results to HDFS so that in the event of a failure downstream the long phase need not be re-run. Again we want this to be done automatically by the system so the user doesn't need to control this level of detail. Again one of the talks implied there is a prototype out there that launches hive jobs into tez. I would like to see that, it might answer more questions then a power point, and I could profile some common queries. As mentioned in a previous email afaik Gunther's pushed all these changes to the Tez branch in Hive. Alan. Random late night thoughts over, Ed On Tue, Jul 30, 2013 at 12:02 AM, Edward Capriolo edlinuxg...@gmail.comwrote: At ~25:00 There is a working prototype of hive which is using tez as the targeted runtime Can I get a look at that code? Is it on github? Edward On Wed, Jul 17, 2013 at 3:35 PM, Alan Gates ga...@hortonworks.com wrote: Answers to some of your questions inlined. Alan. On Jul 16, 2013, at 10:20 PM, Edward Capriolo wrote: There are some points I want to bring up. First, I am on the PMC. Here is something I find relevant: http://www.apache.org/foundation/how-it-works.html -- The role of the PMC from a Foundation perspective is oversight. The main role of the PMC is not code and not coding - but to ensure that all legal issues are addressed, that procedure is followed, and that each and every
Re: Tez branch and tez based patches
At ~25:00 There is a working prototype of hive which is using tez as the targeted runtime Can I get a look at that code? Is it on github? Edward On Wed, Jul 17, 2013 at 3:35 PM, Alan Gates ga...@hortonworks.com wrote: Answers to some of your questions inlined. Alan. On Jul 16, 2013, at 10:20 PM, Edward Capriolo wrote: There are some points I want to bring up. First, I am on the PMC. Here is something I find relevant: http://www.apache.org/foundation/how-it-works.html -- The role of the PMC from a Foundation perspective is oversight. The main role of the PMC is not code and not coding - but to ensure that all legal issues are addressed, that procedure is followed, and that each and every release is the product of the community as a whole. That is key to our litigation protection mechanisms. Secondly the role of the PMC is to further the long term development and health of the community as a whole, and to ensure that balanced and wide scale peer review and collaboration does happen. Within the ASF we worry about any community which centers around a few individuals who are working virtually uncontested. We believe that this is detrimental to quality, stability, and robustness of both code and long term social structures. https://blogs.apache.org/comdev/entry/what_makes_apache_projects_different - All other decisions happen on the dev list, discussions on the private list are kept to a minimum. If it didn't happen on the dev list, it didn't happen - which leads to: a) Elections of committers and PMC members are published on the dev list once finalized. b) Out-of-band discussions (IRC etc.) are summarized on the dev list as soon as they have impact on the project, code or community. - https://issues.apache.org/jira/browse/HIVE-4660 ironically titled Let their be Tez has not be +1 ed by any committer. It was never discussed on the dev or the user list (as far as I can tell). As all JIRA creations and updates are sent to dev@hive, creating a JIRA is de facto posting to the list. As a PMC member I feel we need more discussion on Tez on the dev list along with a wiki-fied design document. Topics of discussion should include: I talked with Gunther and he's working on posting a design doc on the wiki. He has a PDF on the JIRA but he doesn't have write permissions yet on the wiki. 1) What is tez? In Hadoop 2.0, YARN opens up the ability to have multiple execution frameworks in Hadoop. Hadoop apps are no longer tied to MapReduce as the only execution option. Tez is an effort to build an execution engine that is optimized for relational data processing, such as Hive and Pig. The biggest change here is to move away from only Map and Reduce as processing options and to allow alternate combinations of processing, such as map - reduce - reduce or tasks that take multiple inputs or shuffles that avoid sorting when it isn't needed. For a good intro to Tez, see Arun's presentation on it at the recent Hadoop summit (video http://www.youtube.com/watch?v=9ZLLzlsz7h8 slides http://www.slideshare.net/Hadoop_Summit/murhty-saha-june26255pmroom212) 2) How is tez different from oozie, http://code.google.com/p/hop/, http://cs.brown.edu/~backman/cmr.html , and other DAG and or streaming map reduce tools/frameworks? Why should we use this and not those? Oozie is a completely different thing. Oozie is a workflow engine and a scheduler. It's core competencies are the ability to coordinate workflows of disparate job types (MR, Pig, Hive, etc.) and to schedule them. It is not intended as an execution engine for apps such as Pig and Hive. I am not familiar with these other engines, but the short answer is that Tez is built to work on YARN, which works well for Hive since it is tied to Hadoop. 3) When can we expect the first tez release? I don't know, but I hope sometime this fall. 4) How much effort is involved in integrating hive and tez? Covered in the design doc. 5) Who is ready to commit to this effort? I'll let people speak for themselves on that one. 6) can we expect this work to be done in one hive release? Unlikely. Initial integration will be done in one release, but as Tez is a new project I expect it will be adding features in the future that Hive will want to take advantage of. In my opinion we should not start any work on this tez-hive until these questions are answered to the satisfaction of the hive developers. Can we change this to not commit patches? We can't tell willing people not to work on it. On Mon, Jul 15, 2013 at 9:51 PM, Edward Capriolo edlinuxg...@gmail.com wrote: The Hive bylaws, https://cwiki.apache.org/confluence/display/Hive/Bylaws , lay out what votes are needed for
Re: Tez branch and tez based patches
Also watched http://www.ustream.tv/recorded/36323173 I definitely see the win in being able to stream inter-stage output. I see some cases where small intermediate results can be kept In memory. But I was somewhat under the impression that the map reduce spill settings kept stuff in memory, isn't that what spill settings are? There is a few bullet points that came up repeatedly that I do not follow: Something was said to the effect of Container reuse makes X faster. Hadoop has jvm reuse. Not following what the difference is here? Not everyone has a 10K node cluster. Joins in map reduce are hard Really? I mean some of them are I guess, but the typical join is very easy. Just shuffle by the join key. There was not really enough low level details here saying why joins are better in tez. Chosing the number of maps and reduces is hard Really? I do not find it that hard, I think there are times when it's not perfect but I do not find it hard. The talk did not really offer anything here technical on how tez makes this better other then it could make it better. The presentations mentioned streaming data, how do two nodes stream data between a tasks and how it it reliable? If the sender or receiver dies does the entire process have to start again? Again one of the talks implied there is a prototype out there that launches hive jobs into tez. I would like to see that, it might answer more questions then a power point, and I could profile some common queries. Random late night thoughts over, Ed On Tue, Jul 30, 2013 at 12:02 AM, Edward Capriolo edlinuxg...@gmail.comwrote: At ~25:00 There is a working prototype of hive which is using tez as the targeted runtime Can I get a look at that code? Is it on github? Edward On Wed, Jul 17, 2013 at 3:35 PM, Alan Gates ga...@hortonworks.com wrote: Answers to some of your questions inlined. Alan. On Jul 16, 2013, at 10:20 PM, Edward Capriolo wrote: There are some points I want to bring up. First, I am on the PMC. Here is something I find relevant: http://www.apache.org/foundation/how-it-works.html -- The role of the PMC from a Foundation perspective is oversight. The main role of the PMC is not code and not coding - but to ensure that all legal issues are addressed, that procedure is followed, and that each and every release is the product of the community as a whole. That is key to our litigation protection mechanisms. Secondly the role of the PMC is to further the long term development and health of the community as a whole, and to ensure that balanced and wide scale peer review and collaboration does happen. Within the ASF we worry about any community which centers around a few individuals who are working virtually uncontested. We believe that this is detrimental to quality, stability, and robustness of both code and long term social structures. https://blogs.apache.org/comdev/entry/what_makes_apache_projects_different - All other decisions happen on the dev list, discussions on the private list are kept to a minimum. If it didn't happen on the dev list, it didn't happen - which leads to: a) Elections of committers and PMC members are published on the dev list once finalized. b) Out-of-band discussions (IRC etc.) are summarized on the dev list as soon as they have impact on the project, code or community. - https://issues.apache.org/jira/browse/HIVE-4660 ironically titled Let their be Tez has not be +1 ed by any committer. It was never discussed on the dev or the user list (as far as I can tell). As all JIRA creations and updates are sent to dev@hive, creating a JIRA is de facto posting to the list. As a PMC member I feel we need more discussion on Tez on the dev list along with a wiki-fied design document. Topics of discussion should include: I talked with Gunther and he's working on posting a design doc on the wiki. He has a PDF on the JIRA but he doesn't have write permissions yet on the wiki. 1) What is tez? In Hadoop 2.0, YARN opens up the ability to have multiple execution frameworks in Hadoop. Hadoop apps are no longer tied to MapReduce as the only execution option. Tez is an effort to build an execution engine that is optimized for relational data processing, such as Hive and Pig. The biggest change here is to move away from only Map and Reduce as processing options and to allow alternate combinations of processing, such as map - reduce - reduce or tasks that take multiple inputs or shuffles that avoid sorting when it isn't needed. For a good intro to Tez, see Arun's presentation on it at the recent Hadoop summit (video http://www.youtube.com/watch?v=9ZLLzlsz7h8 slides http://www.slideshare.net/Hadoop_Summit/murhty-saha-june26255pmroom212) 2) How is tez different from oozie,
Re: Tez branch and tez based patches
I have finally gotten access to wiki and added the design doc: https://cwiki.apache.org/confluence/display/Hive/Hive+on+Tez I've also added links to it from the jira and in general overhauled the design. Please let me know if you feel there's still stuff missing from the document. Possibly we should be thinking on how to build hive in such a way that many different frameworks could plug in. I believe that the proposed design and refactoring puts you on that path. I'm not introducing layer upon layer of abstraction without a specific use case in mind, but high level you would go through similar steps: Exec layer: - Define your own Task classes - If you can reuse the operator pipeline define your own replacement for ExecMapper/ExecReducer (glue code to drive records through the pipeline) - Operators: You might have to add specific operators for your framework Planning: - Define your own work classes (or reuse existing ones). These abstractly encapsulate all input/meta info necessary to execute. - Define your own *Compiler to translate either the logical plan or physical plan to a graph of Tasks. This might include specific additional optimizations. Devil's in the details no doubt. Thanks, Gunther. On Sat, Jul 20, 2013 at 8:10 AM, Edward Capriolo edlinuxg...@gmail.comwrote: I agree we are getting into grey area with the term disruptive. For reference ( I have not been doing this all the time bad on me) we are supposed to +1 and wait a day. I am not familiar with these other engines, but the short answer is that Tez is built to work on YARN, which works well for Hive since it is tied to Hadoop I understand what you are saying here yarn support is a plus. However the rest of the answer is something relevant to the discussion. There are already frameworks like spark that are semi popular. http://www.slideshare.net/jetlore/spark-and-shark-lightningfast-analytics-over-hadoop-and-hive-data . There are also other framworks like s4 http://incubator.apache.org/s4/, or storm. A big part of making a design decision is doing a competitive analysis. Usually asking yourself What else for this is already out there? or Can this be done other ways? I do want to be convinced we do not lock into tez too early with tunnel vision. Possibly we should be thinking on how to build hive in such a way that many different frameworks could plug in. In other words convincing that tez is the best choice, since many people are claiming an mrr type solution. I will watch the video you posted and study the material myself as well. On Wed, Jul 17, 2013 at 8:43 PM, Ashutosh Chauhan hashut...@apache.org wrote: On Wed, Jul 17, 2013 at 1:41 PM, Edward Capriolo edlinuxg...@gmail.com wrote: In my opinion we should limit the amount of tez related optimizations to and trunk Refactoring that cleans up code is good, but as you have pointed out there wont be a tez release until sometime this fall, and this branch will be open for an extended period of time. Thus code cleanups and other tez related refactoring does not need to be disruptive to trunk. I agree Tez specific changes need not to go in trunk. But general refactoring and code cleanup needs to happen on trunk as and when someone is willing to work on those. We have to continually improve our code quality. Code maintainability and readability is a priority. Without that code quality suffers and discourages new contributors to contribute because code is unnecessarily complicated. SemanticAnalyzer is 11K line class. We need to simplify it. Patch like HIVE-4811 is a welcome change which tackled it. Exec package is all convoluted which mixes up runtime operators and drivers for runtime. Thats a welcome patch because it makes it much more easy to read and reason about that piece of code. HIVE-4825 is another example which improves modularity of code. For contributors who are exposed to Hive first time it will be easier for them to follow the code. Rather than disruptive to trunk, they are constructive for trunk and I am glad people are choosing to work on that. Tez or no Tez Hive is better off with these patches. Thanks, Ashutosh On Wed, Jul 17, 2013 at 3:35 PM, Alan Gates ga...@hortonworks.com wrote: Answers to some of your questions inlined. Alan. On Jul 16, 2013, at 10:20 PM, Edward Capriolo wrote: There are some points I want to bring up. First, I am on the PMC. Here is something I find relevant: http://www.apache.org/foundation/how-it-works.html -- The role of the PMC from a Foundation perspective is oversight. The main role of the PMC is not code and not coding - but to ensure that all legal issues are addressed, that procedure is followed, and that each and every release is the product of the community as a whole. That is key to our
Re: Tez branch and tez based patches
I agree we are getting into grey area with the term disruptive. For reference ( I have not been doing this all the time bad on me) we are supposed to +1 and wait a day. I am not familiar with these other engines, but the short answer is that Tez is built to work on YARN, which works well for Hive since it is tied to Hadoop I understand what you are saying here yarn support is a plus. However the rest of the answer is something relevant to the discussion. There are already frameworks like spark that are semi popular. http://www.slideshare.net/jetlore/spark-and-shark-lightningfast-analytics-over-hadoop-and-hive-data. There are also other framworks like s4 http://incubator.apache.org/s4/, or storm. A big part of making a design decision is doing a competitive analysis. Usually asking yourself What else for this is already out there? or Can this be done other ways? I do want to be convinced we do not lock into tez too early with tunnel vision. Possibly we should be thinking on how to build hive in such a way that many different frameworks could plug in. In other words convincing that tez is the best choice, since many people are claiming an mrr type solution. I will watch the video you posted and study the material myself as well. On Wed, Jul 17, 2013 at 8:43 PM, Ashutosh Chauhan hashut...@apache.orgwrote: On Wed, Jul 17, 2013 at 1:41 PM, Edward Capriolo edlinuxg...@gmail.com wrote: In my opinion we should limit the amount of tez related optimizations to and trunk Refactoring that cleans up code is good, but as you have pointed out there wont be a tez release until sometime this fall, and this branch will be open for an extended period of time. Thus code cleanups and other tez related refactoring does not need to be disruptive to trunk. I agree Tez specific changes need not to go in trunk. But general refactoring and code cleanup needs to happen on trunk as and when someone is willing to work on those. We have to continually improve our code quality. Code maintainability and readability is a priority. Without that code quality suffers and discourages new contributors to contribute because code is unnecessarily complicated. SemanticAnalyzer is 11K line class. We need to simplify it. Patch like HIVE-4811 is a welcome change which tackled it. Exec package is all convoluted which mixes up runtime operators and drivers for runtime. Thats a welcome patch because it makes it much more easy to read and reason about that piece of code. HIVE-4825 is another example which improves modularity of code. For contributors who are exposed to Hive first time it will be easier for them to follow the code. Rather than disruptive to trunk, they are constructive for trunk and I am glad people are choosing to work on that. Tez or no Tez Hive is better off with these patches. Thanks, Ashutosh On Wed, Jul 17, 2013 at 3:35 PM, Alan Gates ga...@hortonworks.com wrote: Answers to some of your questions inlined. Alan. On Jul 16, 2013, at 10:20 PM, Edward Capriolo wrote: There are some points I want to bring up. First, I am on the PMC. Here is something I find relevant: http://www.apache.org/foundation/how-it-works.html -- The role of the PMC from a Foundation perspective is oversight. The main role of the PMC is not code and not coding - but to ensure that all legal issues are addressed, that procedure is followed, and that each and every release is the product of the community as a whole. That is key to our litigation protection mechanisms. Secondly the role of the PMC is to further the long term development and health of the community as a whole, and to ensure that balanced and wide scale peer review and collaboration does happen. Within the ASF we worry about any community which centers around a few individuals who are working virtually uncontested. We believe that this is detrimental to quality, stability, and robustness of both code and long term social structures. https://blogs.apache.org/comdev/entry/what_makes_apache_projects_different - All other decisions happen on the dev list, discussions on the private list are kept to a minimum. If it didn't happen on the dev list, it didn't happen - which leads to: a) Elections of committers and PMC members are published on the dev list once finalized. b) Out-of-band discussions (IRC etc.) are summarized on the dev list as soon as they have impact on the project, code or community. - https://issues.apache.org/jira/browse/HIVE-4660 ironically titled Let their be Tez has not be +1 ed by any committer. It was never discussed on the dev or the user list (as far as I can tell). As all
Re: Tez branch and tez based patches
Answers to some of your questions inlined. Alan. On Jul 16, 2013, at 10:20 PM, Edward Capriolo wrote: There are some points I want to bring up. First, I am on the PMC. Here is something I find relevant: http://www.apache.org/foundation/how-it-works.html -- The role of the PMC from a Foundation perspective is oversight. The main role of the PMC is not code and not coding - but to ensure that all legal issues are addressed, that procedure is followed, and that each and every release is the product of the community as a whole. That is key to our litigation protection mechanisms. Secondly the role of the PMC is to further the long term development and health of the community as a whole, and to ensure that balanced and wide scale peer review and collaboration does happen. Within the ASF we worry about any community which centers around a few individuals who are working virtually uncontested. We believe that this is detrimental to quality, stability, and robustness of both code and long term social structures. https://blogs.apache.org/comdev/entry/what_makes_apache_projects_different - All other decisions happen on the dev list, discussions on the private list are kept to a minimum. If it didn't happen on the dev list, it didn't happen - which leads to: a) Elections of committers and PMC members are published on the dev list once finalized. b) Out-of-band discussions (IRC etc.) are summarized on the dev list as soon as they have impact on the project, code or community. - https://issues.apache.org/jira/browse/HIVE-4660 ironically titled Let their be Tez has not be +1 ed by any committer. It was never discussed on the dev or the user list (as far as I can tell). As all JIRA creations and updates are sent to dev@hive, creating a JIRA is de facto posting to the list. As a PMC member I feel we need more discussion on Tez on the dev list along with a wiki-fied design document. Topics of discussion should include: I talked with Gunther and he's working on posting a design doc on the wiki. He has a PDF on the JIRA but he doesn't have write permissions yet on the wiki. 1) What is tez? In Hadoop 2.0, YARN opens up the ability to have multiple execution frameworks in Hadoop. Hadoop apps are no longer tied to MapReduce as the only execution option. Tez is an effort to build an execution engine that is optimized for relational data processing, such as Hive and Pig. The biggest change here is to move away from only Map and Reduce as processing options and to allow alternate combinations of processing, such as map - reduce - reduce or tasks that take multiple inputs or shuffles that avoid sorting when it isn't needed. For a good intro to Tez, see Arun's presentation on it at the recent Hadoop summit (video http://www.youtube.com/watch?v=9ZLLzlsz7h8 slides http://www.slideshare.net/Hadoop_Summit/murhty-saha-june26255pmroom212) 2) How is tez different from oozie, http://code.google.com/p/hop/, http://cs.brown.edu/~backman/cmr.html , and other DAG and or streaming map reduce tools/frameworks? Why should we use this and not those? Oozie is a completely different thing. Oozie is a workflow engine and a scheduler. It's core competencies are the ability to coordinate workflows of disparate job types (MR, Pig, Hive, etc.) and to schedule them. It is not intended as an execution engine for apps such as Pig and Hive. I am not familiar with these other engines, but the short answer is that Tez is built to work on YARN, which works well for Hive since it is tied to Hadoop. 3) When can we expect the first tez release? I don't know, but I hope sometime this fall. 4) How much effort is involved in integrating hive and tez? Covered in the design doc. 5) Who is ready to commit to this effort? I'll let people speak for themselves on that one. 6) can we expect this work to be done in one hive release? Unlikely. Initial integration will be done in one release, but as Tez is a new project I expect it will be adding features in the future that Hive will want to take advantage of. In my opinion we should not start any work on this tez-hive until these questions are answered to the satisfaction of the hive developers. Can we change this to not commit patches? We can't tell willing people not to work on it. On Mon, Jul 15, 2013 at 9:51 PM, Edward Capriolo edlinuxg...@gmail.comwrote: The Hive bylaws, https://cwiki.apache.org/confluence/display/Hive/Bylaws , lay out what votes are needed for what. I don't see anything there about needing 3 +1s for a branch. Branching would seem to fall under code change, which requires one vote and a minimum length of 1 day. You could argue that all you need is one +1 to create a branch, but this is more then a branch. If you are
Re: Tez branch and tez based patches
As all JIRA creations and updates are sent to dev@hive, creating a JIRA is de facto posting to the list. Agreed (although several ticket names are non descriptive). Possibly more out-of-band discussions need to be summarized on list. Yes. I will restart this: In my opinion we should not start any work on this tez-hive until these questions are answered to the satisfaction of the hive developers. In my opinion we should limit the amount of tez related optimizations to and trunk Refactoring that cleans up code is good, but as you have pointed out there wont be a tez release until sometime this fall, and this branch will be open for an extended period of time. Thus code cleanups and other tez related refactoring does not need to be disruptive to trunk. I have another relevant question, which I already probably know the answer to, but I will ask it anyway. Because tez is a YARN application, does this mean that Tez will be the first hive feature that will require YARN? (It seems like the answer is yes) On Wed, Jul 17, 2013 at 3:35 PM, Alan Gates ga...@hortonworks.com wrote: Answers to some of your questions inlined. Alan. On Jul 16, 2013, at 10:20 PM, Edward Capriolo wrote: There are some points I want to bring up. First, I am on the PMC. Here is something I find relevant: http://www.apache.org/foundation/how-it-works.html -- The role of the PMC from a Foundation perspective is oversight. The main role of the PMC is not code and not coding - but to ensure that all legal issues are addressed, that procedure is followed, and that each and every release is the product of the community as a whole. That is key to our litigation protection mechanisms. Secondly the role of the PMC is to further the long term development and health of the community as a whole, and to ensure that balanced and wide scale peer review and collaboration does happen. Within the ASF we worry about any community which centers around a few individuals who are working virtually uncontested. We believe that this is detrimental to quality, stability, and robustness of both code and long term social structures. https://blogs.apache.org/comdev/entry/what_makes_apache_projects_different - All other decisions happen on the dev list, discussions on the private list are kept to a minimum. If it didn't happen on the dev list, it didn't happen - which leads to: a) Elections of committers and PMC members are published on the dev list once finalized. b) Out-of-band discussions (IRC etc.) are summarized on the dev list as soon as they have impact on the project, code or community. - https://issues.apache.org/jira/browse/HIVE-4660 ironically titled Let their be Tez has not be +1 ed by any committer. It was never discussed on the dev or the user list (as far as I can tell). As all JIRA creations and updates are sent to dev@hive, creating a JIRA is de facto posting to the list. As a PMC member I feel we need more discussion on Tez on the dev list along with a wiki-fied design document. Topics of discussion should include: I talked with Gunther and he's working on posting a design doc on the wiki. He has a PDF on the JIRA but he doesn't have write permissions yet on the wiki. 1) What is tez? In Hadoop 2.0, YARN opens up the ability to have multiple execution frameworks in Hadoop. Hadoop apps are no longer tied to MapReduce as the only execution option. Tez is an effort to build an execution engine that is optimized for relational data processing, such as Hive and Pig. The biggest change here is to move away from only Map and Reduce as processing options and to allow alternate combinations of processing, such as map - reduce - reduce or tasks that take multiple inputs or shuffles that avoid sorting when it isn't needed. For a good intro to Tez, see Arun's presentation on it at the recent Hadoop summit (video http://www.youtube.com/watch?v=9ZLLzlsz7h8 slides http://www.slideshare.net/Hadoop_Summit/murhty-saha-june26255pmroom212) 2) How is tez different from oozie, http://code.google.com/p/hop/, http://cs.brown.edu/~backman/cmr.html , and other DAG and or streaming map reduce tools/frameworks? Why should we use this and not those? Oozie is a completely different thing. Oozie is a workflow engine and a scheduler. It's core competencies are the ability to coordinate workflows of disparate job types (MR, Pig, Hive, etc.) and to schedule them. It is not intended as an execution engine for apps such as Pig and Hive. I am not familiar with these other engines, but the short answer is that Tez is built to work on YARN, which works well for Hive since it is tied to Hadoop. 3) When can we expect the first tez release? I don't know, but I hope sometime this fall. 4) How much
Re: Tez branch and tez based patches
On Jul 17, 2013, at 1:41 PM, Edward Capriolo wrote: In my opinion we should limit the amount of tez related optimizations to and trunk Refactoring that cleans up code is good, but as you have pointed out there wont be a tez release until sometime this fall, and this branch will be open for an extended period of time. Thus code cleanups and other tez related refactoring does not need to be disruptive to trunk. I agree with this, though I suspect people will end up arguing about the meaning of code cleanup and disruptive. In my discussions with Gunther he said he was doing code cleanup and it was not disruptive. You obviously disagreed. I've already suggested that any future patches that break lots of others should have their checkin preceded by a few hours notice that the patch will break things so others can say something if they are about to check in too. I'd also be interested to hear from Gunther how much more general cleanup he feels is necessary on trunk. I have another relevant question, which I already probably know the answer to, but I will ask it anyway. Because tez is a YARN application, does this mean that Tez will be the first hive feature that will require YARN? (It seems like the answer is yes) Yes, it will only work in the Hadoop 2.x world. So obviously all this work needs to be done in a way that still allows Hive to use the MR execution engine in the Hadoop 1.x world. Alan.
Re: Tez branch and tez based patches
On Wed, Jul 17, 2013 at 1:41 PM, Edward Capriolo edlinuxg...@gmail.comwrote: In my opinion we should limit the amount of tez related optimizations to and trunk Refactoring that cleans up code is good, but as you have pointed out there wont be a tez release until sometime this fall, and this branch will be open for an extended period of time. Thus code cleanups and other tez related refactoring does not need to be disruptive to trunk. I agree Tez specific changes need not to go in trunk. But general refactoring and code cleanup needs to happen on trunk as and when someone is willing to work on those. We have to continually improve our code quality. Code maintainability and readability is a priority. Without that code quality suffers and discourages new contributors to contribute because code is unnecessarily complicated. SemanticAnalyzer is 11K line class. We need to simplify it. Patch like HIVE-4811 is a welcome change which tackled it. Exec package is all convoluted which mixes up runtime operators and drivers for runtime. Thats a welcome patch because it makes it much more easy to read and reason about that piece of code. HIVE-4825 is another example which improves modularity of code. For contributors who are exposed to Hive first time it will be easier for them to follow the code. Rather than disruptive to trunk, they are constructive for trunk and I am glad people are choosing to work on that. Tez or no Tez Hive is better off with these patches. Thanks, Ashutosh On Wed, Jul 17, 2013 at 3:35 PM, Alan Gates ga...@hortonworks.com wrote: Answers to some of your questions inlined. Alan. On Jul 16, 2013, at 10:20 PM, Edward Capriolo wrote: There are some points I want to bring up. First, I am on the PMC. Here is something I find relevant: http://www.apache.org/foundation/how-it-works.html -- The role of the PMC from a Foundation perspective is oversight. The main role of the PMC is not code and not coding - but to ensure that all legal issues are addressed, that procedure is followed, and that each and every release is the product of the community as a whole. That is key to our litigation protection mechanisms. Secondly the role of the PMC is to further the long term development and health of the community as a whole, and to ensure that balanced and wide scale peer review and collaboration does happen. Within the ASF we worry about any community which centers around a few individuals who are working virtually uncontested. We believe that this is detrimental to quality, stability, and robustness of both code and long term social structures. https://blogs.apache.org/comdev/entry/what_makes_apache_projects_different - All other decisions happen on the dev list, discussions on the private list are kept to a minimum. If it didn't happen on the dev list, it didn't happen - which leads to: a) Elections of committers and PMC members are published on the dev list once finalized. b) Out-of-band discussions (IRC etc.) are summarized on the dev list as soon as they have impact on the project, code or community. - https://issues.apache.org/jira/browse/HIVE-4660 ironically titled Let their be Tez has not be +1 ed by any committer. It was never discussed on the dev or the user list (as far as I can tell). As all JIRA creations and updates are sent to dev@hive, creating a JIRA is de facto posting to the list. As a PMC member I feel we need more discussion on Tez on the dev list along with a wiki-fied design document. Topics of discussion should include: I talked with Gunther and he's working on posting a design doc on the wiki. He has a PDF on the JIRA but he doesn't have write permissions yet on the wiki. 1) What is tez? In Hadoop 2.0, YARN opens up the ability to have multiple execution frameworks in Hadoop. Hadoop apps are no longer tied to MapReduce as the only execution option. Tez is an effort to build an execution engine that is optimized for relational data processing, such as Hive and Pig. The biggest change here is to move away from only Map and Reduce as processing options and to allow alternate combinations of processing, such as map - reduce - reduce or tasks that take multiple inputs or shuffles that avoid sorting when it isn't needed. For a good intro to Tez, see Arun's presentation on it at the recent Hadoop summit (video http://www.youtube.com/watch?v=9ZLLzlsz7h8 slides http://www.slideshare.net/Hadoop_Summit/murhty-saha-june26255pmroom212) 2) How is tez different from oozie, http://code.google.com/p/hop/, http://cs.brown.edu/~backman/cmr.html , and other DAG and or streaming map reduce tools/frameworks? Why should we use this and
Re: Tez branch and tez based patches
Ed, I'm not sure I understand your argument, so I'm going to try to restate it. Please tell me if I understand it correctly. I think you're saying we should not embark on big projects in Hive because: 1) There were big projects in the past that were abandoned or are not currently making progress (such as Oracle integration, Hive StorageHandler) 2) There are other big projects going on (ORC, Vectorization) 3) There are lots of out standing patches that need to be dealt with. I would respond with two points to this. First, I agree that the large out standing patch count is very bad. It keeps people from getting involved in Hive. It deprives Hive of fixes and improvements it would otherwise have. Several of the committers are working to address this by checking in peoples' patches, but they are unable to keep up. The best solution is to encourage other committers to check in patches as well and to find willing and able contributors and mentor them to committership as quickly as possible. Second, the way Apache works is that contributors scratch the itch that bothers them. So to argue We shouldn't do X because we never finished Y or We shouldn't do X because we're doing Y (where X and Y are independent) is not valid in Apache projects. It's fine to argue that Tez hasn't been adequately explained (I think you hinted at this in previous emails) and ask for clarifications on what it is and what the planned changes are. If after a full explanation you think it's a bad idea it's fine to argue Tez is the wrong direction for Hive and try to convince the rest of the community. But assuming the community accepts that Tez is a reasonable direction and there are volunteers who want to do the work, then you can't argue they should work on something else instead. Alan. On Jul 15, 2013, at 6:51 PM, Edward Capriolo wrote: The Hive bylaws, https://cwiki.apache.org/confluence/display/Hive/Bylaws, lay out what votes are needed for what. I don't see anything there about needing 3 +1s for a branch. Branching would seem to fall under code change, which requires one vote and a minimum length of 1 day. You could argue that all you need is one +1 to create a branch, but this is more then a branch. If you are talking about something that is: 1) going to cause major re-factoring of critical pieces of hive like ExecDriver and MapRedTask 2) going to be very disruptive to the efforts of other committers 3) something that may be a major architectural change Getting the project on board with the idea is a good idea. Now I want to point something out. Here are some recent initiatives in hive: 1) At one point there was a big initiative to support oracle after the initial work, there are patches in Jira no one seems to care about oracle support. 2) Another such decisions was this support windows one, there are probably 4 windows patches waiting reviews. 3) I still have no clue what the official hadoop1 hadoop2, hadoop 0.23 support prospective is, but every couple weeks we get another jira about something not working/testing on one of those versions, seems like several builds are broken. 4) Hive-storage handler, after the initial implementation no one cares to review any other storage handler implementation, 3 patches there or more, could not even find anyone willing to review the cassandra storage handler I spent months on. 5) OCR, Vectorization 6) Windowing: committed, numerous check-style violations. We have !!!160+!!! PATCH_AVAILABLE Jira issues. Few active committers. We are spread very thin, and embarking on another side project not involved with core hive seems like the wrong direction at the moment. On Mon, Jul 15, 2013 at 8:37 PM, Alan Gates ga...@hortonworks.com wrote: On Jul 13, 2013, at 9:48 AM, Edward Capriolo wrote: I have started to see several re factoring patches around tez. https://issues.apache.org/jira/browse/HIVE-4843 This is the only mention on the hive list I can find with tez: Makes sense. I will create the branch soon. Thanks, Ashutosh On Tue, Jun 11, 2013 at 7:44 PM, Gunther Hagleitner ghagleit...@hortonworks.com wrote: Hi, I am starting to work on integrating Tez into Hive (see HIVE-4660, design doc has already been uploaded - any feedback will be much appreciated). This will be a fair amount of work that will take time to stabilize/test. I'd like to propose creating a branch in order to be able to do this incrementally and collaboratively. In order to progress rapidly with this, I would also like to go commit-then-review. Thanks, Gunther. These refactor-ings are largely destructive to a number of bugs and language improvements in hive.The language improvements and bug fixes that have been sitting in Jira for quite some time now marked patch-available and are waiting for review. There are a few things I want to point out: 1) Normally we create design
Re: Tez branch and tez based patches
Alan, I agree with all your statements, with the exception of one. Second, the way Apache works is that contributors scratch the itch that = bothers them. So to argue We shouldn't do X because we never finished = Y or We shouldn't do X because we're doing Y (where X and Y are = independent) is not valid in Apache projects. I disagree, look at this: https://issues.apache.org/jira/browse/HIVE-3585 A contribution was immediately met with a -1. I personally have had issues closed as WONT FIX, LATER across a variety of apache projects because said committers decided the feature was out of scope, or whatever. Arguing that if one contributer wants to scratch an itch we should allow it in the project is not practical. Because we have to be able to maintain hive after the itch scratcher finds a new itch, and moves on. Hive is not project hosting for every cool idea. This was why I mentioned things like windows support, I do not think there was ever a point where the committers/PMC agreed that windows support was something we all wanted to work towards. I can not pin down how the initiative started and why. Now whoever started that ball rolling has moved on. I do not own a windows computer, we have no apache infrastructure to test hive on windows. Jira issues stay open, those of us in it for the long haul and up holding the ball, and supporting things we never explicitly wanted. As this relates to Tez, tez is in the incubator. Hive is release quality software. I am not convinced Tez is the direction we should go in. I am scared of it going the path of windows support or oracle support, because someone scratching an itch and we (the committers) do not have enough information, about the changes involved, the timeline, what types of use cases will benefit from this feature. Tez refactoring are getting filed as 'MAJOR' 'BUGS' and getting committed to trunk, when they are 'IMPROVEMENTS' that are 'LOW' priority. I do not understand why there is such a priority to merge code into trunk, when we can all see this branch is going to be opened for a long time and be rather involved. Even then I would not mind if it was not largely unfair to everyone else that now needs to rebase. On Tue, Jul 16, 2013 at 2:24 PM, Alan Gates ga...@hortonworks.com wrote: Ed, I'm not sure I understand your argument, so I'm going to try to restate it. Please tell me if I understand it correctly. I think you're saying we should not embark on big projects in Hive because: 1) There were big projects in the past that were abandoned or are not currently making progress (such as Oracle integration, Hive StorageHandler) 2) There are other big projects going on (ORC, Vectorization) 3) There are lots of out standing patches that need to be dealt with. I would respond with two points to this. First, I agree that the large out standing patch count is very bad. It keeps people from getting involved in Hive. It deprives Hive of fixes and improvements it would otherwise have. Several of the committers are working to address this by checking in peoples' patches, but they are unable to keep up. The best solution is to encourage other committers to check in patches as well and to find willing and able contributors and mentor them to committership as quickly as possible. Second, the way Apache works is that contributors scratch the itch that bothers them. So to argue We shouldn't do X because we never finished Y or We shouldn't do X because we're doing Y (where X and Y are independent) is not valid in Apache projects. It's fine to argue that Tez hasn't been adequately explained (I think you hinted at this in previous emails) and ask for clarifications on what it is and what the planned changes are. If after a full explanation you think it's a bad idea it's fine to argue Tez is the wrong direction for Hive and try to convince the rest of the community. But assuming the community accepts that Tez is a reasonable direction and there are volunteers who want to do the work, then you can't argue they should work on something else instead. Alan. On Jul 15, 2013, at 6:51 PM, Edward Capriolo wrote: The Hive bylaws, https://cwiki.apache.org/confluence/display/Hive/Bylaws, lay out what votes are needed for what. I don't see anything there about needing 3 +1s for a branch. Branching would seem to fall under code change, which requires one vote and a minimum length of 1 day. You could argue that all you need is one +1 to create a branch, but this is more then a branch. If you are talking about something that is: 1) going to cause major re-factoring of critical pieces of hive like ExecDriver and MapRedTask 2) going to be very disruptive to the efforts of other committers 3) something that may be a major architectural change Getting the project on board with the idea is a good idea. Now I want to point something out. Here are some recent initiatives in hive: 1) At one point there
Re: Tez branch and tez based patches
There are some points I want to bring up. First, I am on the PMC. Here is something I find relevant: http://www.apache.org/foundation/how-it-works.html -- The role of the PMC from a Foundation perspective is oversight. The main role of the PMC is not code and not coding - but to ensure that all legal issues are addressed, that procedure is followed, and that each and every release is the product of the community as a whole. That is key to our litigation protection mechanisms. Secondly the role of the PMC is to further the long term development and health of the community as a whole, and to ensure that balanced and wide scale peer review and collaboration does happen. Within the ASF we worry about any community which centers around a few individuals who are working virtually uncontested. We believe that this is detrimental to quality, stability, and robustness of both code and long term social structures. https://blogs.apache.org/comdev/entry/what_makes_apache_projects_different - All other decisions happen on the dev list, discussions on the private list are kept to a minimum. If it didn't happen on the dev list, it didn't happen - which leads to: a) Elections of committers and PMC members are published on the dev list once finalized. b) Out-of-band discussions (IRC etc.) are summarized on the dev list as soon as they have impact on the project, code or community. - https://issues.apache.org/jira/browse/HIVE-4660 ironically titled Let their be Tez has not be +1 ed by any committer. It was never discussed on the dev or the user list (as far as I can tell). As a PMC member I feel we need more discussion on Tez on the dev list along with a wiki-fied design document. Topics of discussion should include: 1) What is tez? 2) How is tez different from oozie, http://code.google.com/p/hop/, http://cs.brown.edu/~backman/cmr.html , and other DAG and or streaming map reduce tools/frameworks? Why should we use this and not those? 3) When can we expect the first tez release? 4) How much effort is involved in integrating hive and tez? 5) Who is ready to commit to this effort? 6) can we expect this work to be done in one hive release? In my opinion we should not start any work on this tez-hive until these questions are answered to the satisfaction of the hive developers. On Mon, Jul 15, 2013 at 9:51 PM, Edward Capriolo edlinuxg...@gmail.comwrote: The Hive bylaws, https://cwiki.apache.org/confluence/display/Hive/Bylaws , lay out what votes are needed for what. I don't see anything there about needing 3 +1s for a branch. Branching would seem to fall under code change, which requires one vote and a minimum length of 1 day. You could argue that all you need is one +1 to create a branch, but this is more then a branch. If you are talking about something that is: 1) going to cause major re-factoring of critical pieces of hive like ExecDriver and MapRedTask 2) going to be very disruptive to the efforts of other committers 3) something that may be a major architectural change Getting the project on board with the idea is a good idea. Now I want to point something out. Here are some recent initiatives in hive: 1) At one point there was a big initiative to support oracle after the initial work, there are patches in Jira no one seems to care about oracle support. 2) Another such decisions was this support windows one, there are probably 4 windows patches waiting reviews. 3) I still have no clue what the official hadoop1 hadoop2, hadoop 0.23 support prospective is, but every couple weeks we get another jira about something not working/testing on one of those versions, seems like several builds are broken. 4) Hive-storage handler, after the initial implementation no one cares to review any other storage handler implementation, 3 patches there or more, could not even find anyone willing to review the cassandra storage handler I spent months on. 5) OCR, Vectorization 6) Windowing: committed, numerous check-style violations. We have !!!160+!!! PATCH_AVAILABLE Jira issues. Few active committers. We are spread very thin, and embarking on another side project not involved with core hive seems like the wrong direction at the moment. On Mon, Jul 15, 2013 at 8:37 PM, Alan Gates ga...@hortonworks.com wrote: On Jul 13, 2013, at 9:48 AM, Edward Capriolo wrote: I have started to see several re factoring patches around tez. https://issues.apache.org/jira/browse/HIVE-4843 This is the only mention on the hive list I can find with tez: Makes sense. I will create the branch soon. Thanks, Ashutosh On Tue, Jun 11, 2013 at 7:44 PM, Gunther Hagleitner ghagleit...@hortonworks.com wrote: Hi, I am starting to work on integrating Tez into Hive (see HIVE-4660, design doc has already been uploaded - any feedback
Re: Tez branch and tez based patches
On Jul 13, 2013, at 9:48 AM, Edward Capriolo wrote: I have started to see several re factoring patches around tez. https://issues.apache.org/jira/browse/HIVE-4843 This is the only mention on the hive list I can find with tez: Makes sense. I will create the branch soon. Thanks, Ashutosh On Tue, Jun 11, 2013 at 7:44 PM, Gunther Hagleitner ghagleit...@hortonworks.com wrote: Hi, I am starting to work on integrating Tez into Hive (see HIVE-4660, design doc has already been uploaded - any feedback will be much appreciated). This will be a fair amount of work that will take time to stabilize/test. I'd like to propose creating a branch in order to be able to do this incrementally and collaboratively. In order to progress rapidly with this, I would also like to go commit-then-review. Thanks, Gunther. These refactor-ings are largely destructive to a number of bugs and language improvements in hive.The language improvements and bug fixes that have been sitting in Jira for quite some time now marked patch-available and are waiting for review. There are a few things I want to point out: 1) Normally we create design docs in out wiki (which it is not) 2) Normally when the change is significantly complex we get multiple committers to comment on it (which we did not) On point 2 no one -1 the branch, but this is really something that should have required a +1 from 3 committers. The Hive bylaws, https://cwiki.apache.org/confluence/display/Hive/Bylaws , lay out what votes are needed for what. I don't see anything there about needing 3 +1s for a branch. Branching would seem to fall under code change, which requires one vote and a minimum length of 1 day. I for one am not completely sold on Tez. http://incubator.apache.org/projects/tez.html. directed-acyclic-graph of tasks for processing data this description sounds like many things which have never become popular. One to think of is oozie Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions.. I am sure I can find a number of libraries/frameworks that make this same claim. In general I do not feel like we have done our homework and pre-requisites to justify all this work. If we have done the homework, I am sure that it has not been communicated and accepted by hive developers at large. A request for better documentation on Tez and a project road map seems totally reasonable. If we have a branch, why are we also committing on trunk? Scanning through the tez doc the only language I keep finding language like minimal changes to the planner yet, there is ALREADY lots of large changes going on! Really none of the above would bother me accept for the fact that these minimal changes are causing many patch available ready-for-review bugs and core hive features to need to be re based. I am sure I have mentioned this before, but I have to spend 12+ hours to test a single patch on my laptop. A few days ago I was testing a new core hive feature. After all the tests passed and before I was able to commit, someone unleashed a tez patch on trunk which caused the thing I was testing for 12 hours to need to be rebased. I'm not cool with this.Next time that happens to me I will seriously consider reverting the patch. Bug fixes and new hive features are more important to me then integrating with incubator projects. (With my Apache member hat on) Reverting patches that aren't breaking the build is considered very bad form in Apache. It does make sense to request that when people are going to commit a patch that will break many other patches they first give a few hours of notice so people can say something if they're about to commit another patch and avoid your fate of needing to rerun the tests. The other thing is we need to get get the automated build of patches working on Hive so committers are forced to run all of the tests themselves. We are working on it, but we're not there yet. Alan.
Re: Tez branch and tez based patches
The Hive bylaws, https://cwiki.apache.org/confluence/display/Hive/Bylaws, lay out what votes are needed for what. I don't see anything there about needing 3 +1s for a branch. Branching would seem to fall under code change, which requires one vote and a minimum length of 1 day. You could argue that all you need is one +1 to create a branch, but this is more then a branch. If you are talking about something that is: 1) going to cause major re-factoring of critical pieces of hive like ExecDriver and MapRedTask 2) going to be very disruptive to the efforts of other committers 3) something that may be a major architectural change Getting the project on board with the idea is a good idea. Now I want to point something out. Here are some recent initiatives in hive: 1) At one point there was a big initiative to support oracle after the initial work, there are patches in Jira no one seems to care about oracle support. 2) Another such decisions was this support windows one, there are probably 4 windows patches waiting reviews. 3) I still have no clue what the official hadoop1 hadoop2, hadoop 0.23 support prospective is, but every couple weeks we get another jira about something not working/testing on one of those versions, seems like several builds are broken. 4) Hive-storage handler, after the initial implementation no one cares to review any other storage handler implementation, 3 patches there or more, could not even find anyone willing to review the cassandra storage handler I spent months on. 5) OCR, Vectorization 6) Windowing: committed, numerous check-style violations. We have !!!160+!!! PATCH_AVAILABLE Jira issues. Few active committers. We are spread very thin, and embarking on another side project not involved with core hive seems like the wrong direction at the moment. On Mon, Jul 15, 2013 at 8:37 PM, Alan Gates ga...@hortonworks.com wrote: On Jul 13, 2013, at 9:48 AM, Edward Capriolo wrote: I have started to see several re factoring patches around tez. https://issues.apache.org/jira/browse/HIVE-4843 This is the only mention on the hive list I can find with tez: Makes sense. I will create the branch soon. Thanks, Ashutosh On Tue, Jun 11, 2013 at 7:44 PM, Gunther Hagleitner ghagleit...@hortonworks.com wrote: Hi, I am starting to work on integrating Tez into Hive (see HIVE-4660, design doc has already been uploaded - any feedback will be much appreciated). This will be a fair amount of work that will take time to stabilize/test. I'd like to propose creating a branch in order to be able to do this incrementally and collaboratively. In order to progress rapidly with this, I would also like to go commit-then-review. Thanks, Gunther. These refactor-ings are largely destructive to a number of bugs and language improvements in hive.The language improvements and bug fixes that have been sitting in Jira for quite some time now marked patch-available and are waiting for review. There are a few things I want to point out: 1) Normally we create design docs in out wiki (which it is not) 2) Normally when the change is significantly complex we get multiple committers to comment on it (which we did not) On point 2 no one -1 the branch, but this is really something that should have required a +1 from 3 committers. The Hive bylaws, https://cwiki.apache.org/confluence/display/Hive/Bylaws, lay out what votes are needed for what. I don't see anything there about needing 3 +1s for a branch. Branching would seem to fall under code change, which requires one vote and a minimum length of 1 day. I for one am not completely sold on Tez. http://incubator.apache.org/projects/tez.html. directed-acyclic-graph of tasks for processing data this description sounds like many things which have never become popular. One to think of is oozie Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions.. I am sure I can find a number of libraries/frameworks that make this same claim. In general I do not feel like we have done our homework and pre-requisites to justify all this work. If we have done the homework, I am sure that it has not been communicated and accepted by hive developers at large. A request for better documentation on Tez and a project road map seems totally reasonable. If we have a branch, why are we also committing on trunk? Scanning through the tez doc the only language I keep finding language like minimal changes to the planner yet, there is ALREADY lots of large changes going on! Really none of the above would bother me accept for the fact that these minimal changes are causing many patch available ready-for-review bugs and core hive features to need to be re based. I am sure I have mentioned this before, but I have to spend 12+ hours to test a single patch on my laptop. A few days ago I was testing a new core hive feature. After all the