Re: Tez branch and tez based patches

2013-08-16 Thread Edward Capriolo
I still am not sure we are doing this the ideal way. I am not a believer in
a commit-then-review branch.

This issue is an example.

https://issues.apache.org/jira/browse/HIVE-5108

I ask myself these questions:
Does this currently work? Are their tests? If so which ones are broken? How
does the patch fix them without tests to validate?

Having a commit-then-review branch just seems subversive to our normal
process, and a quick short cut to not have to be bothered by writing tests
or involving anyone else.



On Mon, Aug 5, 2013 at 1:54 PM, Alan Gates ga...@hortonworks.com wrote:


 On Jul 29, 2013, at 9:53 PM, Edward Capriolo wrote:

  Also watched http://www.ustream.tv/recorded/36323173
 
  I definitely see the win in being able to stream inter-stage output.
 
  I see some cases where small intermediate results can be kept In
 memory.
  But I was somewhat under the impression that the map reduce spill
 settings
  kept stuff in memory, isn't that what spill settings are?

 No.  MapReduce always writes shuffle data to local disk.  And intermediate
 results between MR jobs are always persisted to HDFS, as there's no other
 option.  When we talk of being able to keep intermediate results in memory
 we mean getting rid of both of these disk writes/reads when appropriate
 (meaning not always, there's a trade off between speed and error handling
 to be made here, see below for more details).

 
  There is a few bullet points that came up repeatedly that I do not
 follow:
 
  Something was said to the effect of Container reuse makes X faster.
  Hadoop has jvm reuse. Not following what the difference is here? Not
  everyone has a 10K node cluster.

 Sharing JVMs across users is inherently insecure (we can't guarantee what
 code the first user left behind that may interfere with later users).  As I
 understand container re-use in Tez it constrains the re-use to one user for
 security reasons, but still avoids additional JVM start up costs.  But this
 is a question that the Tez guys could answer better on the Tez lists (
 d...@tez.incubator.apache.org)

 
  Joins in map reduce are hard Really? I mean some of them are I guess,
 but
  the typical join is very easy. Just shuffle by the join key. There was
 not
  really enough low level details here saying why joins are better in tez.

 Join is not a natural operation in MapReduce.  MR gives you one input and
 one output.  You end up having to bend the rules to do have multiple
 inputs.  The idea here is that Tez can provide operators that naturally
 work with joins and other operations that don't fit the one input/one
 output model (eg unions, etc.).

 
  Chosing the number of maps and reduces is hard Really? I do not find it
  that hard, I think there are times when it's not perfect but I do not
 find
  it hard. The talk did not really offer anything here technical on how tez
  makes this better other then it could make it better.

 Perhaps manual would be a better term here than hard.  In our experience
 it takes quite a bit of engineer trial and error to determine the optimal
 numbers.  This may be ok if you're going to invest the time once and then
 run the same query every day for 6 months.  But obviously it doesn't work
 for the ad hoc case.  Even in the batch case it's not optimal because every
 once and a while an engineer has to go back and re-optimize the query to
 deal with changing data sizes, data characteristics, etc.  We want the
 optimizer to handle this without human intervention.

 
  The presentations mentioned streaming data, how do two nodes stream data
  between a tasks and how it it reliable? If the sender or receiver dies
 does
  the entire process have to start again?

 If the sender or receiver dies then the query has to be restarted from
 some previous point where data was persisted to disk.  The idea here is
 that speed vs error recovery trade offs should be made by the optimizer.
  If the optimizer estimates that a query will complete in 5 seconds it can
 stream everything and if a node fails it just re-runs the whole query.  If
 it estimates that a particular phase of a query will run for an hour it can
 choose to persist the results to HDFS so that in the event of a failure
 downstream the long phase need not be re-run.  Again we want this to be
 done automatically by the system so the user doesn't need to control this
 level of detail.

 
  Again one of the talks implied there is a prototype out there that
 launches
  hive jobs into tez. I would like to see that, it might answer more
  questions then a power point, and I could profile some common queries.

 As mentioned in a previous email afaik Gunther's pushed all these changes
 to the Tez branch in Hive.

 Alan.

 
  Random late night thoughts over,
  Ed
 
 
 
 
 
 
  On Tue, Jul 30, 2013 at 12:02 AM, Edward Capriolo edlinuxg...@gmail.com
 wrote:
 
  At ~25:00
 
  There is a working prototype of hive which is using tez as the targeted
  runtime
 
  Can I get a look at that code? Is 

Re: Tez branch and tez based patches

2013-08-16 Thread Edward Capriolo
Commit then review, and self commit, destroys the good things we get from
our normal system.

http://anna.gs/blog/2013/08/12/code-review-ftw/

I am most worried about silo's and knowledge, lax testing policies, and
code quality. Which I now have seen on several occasions when something is
happening in a branch. (not calling out tez branch in particular)



On Fri, Aug 16, 2013 at 9:13 AM, Edward Capriolo edlinuxg...@gmail.comwrote:

 I still am not sure we are doing this the ideal way. I am not a believer
 in a commit-then-review branch.

 This issue is an example.

 https://issues.apache.org/jira/browse/HIVE-5108

 I ask myself these questions:
 Does this currently work? Are their tests? If so which ones are broken?
 How does the patch fix them without tests to validate?

 Having a commit-then-review branch just seems subversive to our normal
 process, and a quick short cut to not have to be bothered by writing tests
 or involving anyone else.



 On Mon, Aug 5, 2013 at 1:54 PM, Alan Gates ga...@hortonworks.com wrote:


 On Jul 29, 2013, at 9:53 PM, Edward Capriolo wrote:

  Also watched http://www.ustream.tv/recorded/36323173
 
  I definitely see the win in being able to stream inter-stage output.
 
  I see some cases where small intermediate results can be kept In
 memory.
  But I was somewhat under the impression that the map reduce spill
 settings
  kept stuff in memory, isn't that what spill settings are?

 No.  MapReduce always writes shuffle data to local disk.  And
 intermediate results between MR jobs are always persisted to HDFS, as
 there's no other option.  When we talk of being able to keep intermediate
 results in memory we mean getting rid of both of these disk writes/reads
 when appropriate (meaning not always, there's a trade off between speed and
 error handling to be made here, see below for more details).

 
  There is a few bullet points that came up repeatedly that I do not
 follow:
 
  Something was said to the effect of Container reuse makes X faster.
  Hadoop has jvm reuse. Not following what the difference is here? Not
  everyone has a 10K node cluster.

 Sharing JVMs across users is inherently insecure (we can't guarantee what
 code the first user left behind that may interfere with later users).  As I
 understand container re-use in Tez it constrains the re-use to one user for
 security reasons, but still avoids additional JVM start up costs.  But this
 is a question that the Tez guys could answer better on the Tez lists (
 d...@tez.incubator.apache.org)

 
  Joins in map reduce are hard Really? I mean some of them are I guess,
 but
  the typical join is very easy. Just shuffle by the join key. There was
 not
  really enough low level details here saying why joins are better in tez.

 Join is not a natural operation in MapReduce.  MR gives you one input and
 one output.  You end up having to bend the rules to do have multiple
 inputs.  The idea here is that Tez can provide operators that naturally
 work with joins and other operations that don't fit the one input/one
 output model (eg unions, etc.).

 
  Chosing the number of maps and reduces is hard Really? I do not find
 it
  that hard, I think there are times when it's not perfect but I do not
 find
  it hard. The talk did not really offer anything here technical on how
 tez
  makes this better other then it could make it better.

 Perhaps manual would be a better term here than hard.  In our experience
 it takes quite a bit of engineer trial and error to determine the optimal
 numbers.  This may be ok if you're going to invest the time once and then
 run the same query every day for 6 months.  But obviously it doesn't work
 for the ad hoc case.  Even in the batch case it's not optimal because every
 once and a while an engineer has to go back and re-optimize the query to
 deal with changing data sizes, data characteristics, etc.  We want the
 optimizer to handle this without human intervention.

 
  The presentations mentioned streaming data, how do two nodes stream data
  between a tasks and how it it reliable? If the sender or receiver dies
 does
  the entire process have to start again?

 If the sender or receiver dies then the query has to be restarted from
 some previous point where data was persisted to disk.  The idea here is
 that speed vs error recovery trade offs should be made by the optimizer.
  If the optimizer estimates that a query will complete in 5 seconds it can
 stream everything and if a node fails it just re-runs the whole query.  If
 it estimates that a particular phase of a query will run for an hour it can
 choose to persist the results to HDFS so that in the event of a failure
 downstream the long phase need not be re-run.  Again we want this to be
 done automatically by the system so the user doesn't need to control this
 level of detail.

 
  Again one of the talks implied there is a prototype out there that
 launches
  hive jobs into tez. I would like to see that, it might answer more
  

Re: Tez branch and tez based patches

2013-08-05 Thread Alan Gates
Which talk are you referencing here?  AFAIK all the Hive code we've written is 
being pushed back into the Tez branch, so you should be able to see it there.

Alan.

On Jul 29, 2013, at 9:02 PM, Edward Capriolo wrote:

 At ~25:00
 
 There is a working prototype of hive which is using tez as the targeted
 runtime
 
 Can I get a look at that code? Is it on github?
 
 Edward
 
 
 On Wed, Jul 17, 2013 at 3:35 PM, Alan Gates ga...@hortonworks.com wrote:
 
 Answers to some of your questions inlined.
 
 Alan.
 
 On Jul 16, 2013, at 10:20 PM, Edward Capriolo wrote:
 
 There are some points I want to bring up. First, I am on the PMC. Here is
 something I find relevant:
 
 http://www.apache.org/foundation/how-it-works.html
 
 --
 
 The role of the PMC from a Foundation perspective is oversight. The main
 role of the PMC is not code and not coding - but to ensure that all legal
 issues are addressed, that procedure is followed, and that each and every
 release is the product of the community as a whole. That is key to our
 litigation protection mechanisms.
 
 Secondly the role of the PMC is to further the long term development and
 health of the community as a whole, and to ensure that balanced and wide
 scale peer review and collaboration does happen. Within the ASF we worry
 about any community which centers around a few individuals who are
 working
 virtually uncontested. We believe that this is detrimental to quality,
 stability, and robustness of both code and long term social structures.
 
 
 
 
 https://blogs.apache.org/comdev/entry/what_makes_apache_projects_different
 
 -
 
 All other decisions happen on the dev list, discussions on the private
 list
 are kept to a minimum.
 
 If it didn't happen on the dev list, it didn't happen - which leads to:
 
 a) Elections of committers and PMC members are published on the dev list
 once finalized.
 
 b) Out-of-band discussions (IRC etc.) are summarized on the dev list as
 soon as they have impact on the project, code or community.
 -
 
 https://issues.apache.org/jira/browse/HIVE-4660 ironically titled Let
 their be Tez has not be +1 ed by any committer. It was never discussed
 on
 the dev or the user list (as far as I can tell).
 
 As all JIRA creations and updates are sent to dev@hive, creating a JIRA
 is de facto posting to the list.
 
 
 As a PMC member I feel we need more discussion on Tez on the dev list
 along
 with a wiki-fied design document. Topics of discussion should include:
 
 I talked with Gunther and he's working on posting a design doc on the
 wiki.  He has a PDF on the JIRA but he doesn't have write permissions yet
 on the wiki.
 
 
 1) What is tez?
 In Hadoop 2.0, YARN opens up the ability to have multiple execution
 frameworks in Hadoop.  Hadoop apps are no longer tied to MapReduce as the
 only execution option.  Tez is an effort to build an execution engine that
 is optimized for relational data processing, such as Hive and Pig.
 
 The biggest change here is to move away from only Map and Reduce as
 processing options and to allow alternate combinations of processing, such
 as map - reduce - reduce or tasks that take multiple inputs or shuffles
 that avoid sorting when it isn't needed.
 
 For a good intro to Tez, see Arun's presentation on it at the recent
 Hadoop summit (video http://www.youtube.com/watch?v=9ZLLzlsz7h8 slides
 http://www.slideshare.net/Hadoop_Summit/murhty-saha-june26255pmroom212)
 
 2) How is tez different from oozie, http://code.google.com/p/hop/,
 http://cs.brown.edu/~backman/cmr.html , and other DAG and or streaming
 map
 reduce tools/frameworks? Why should we use this and not those?
 
 Oozie is a completely different thing.  Oozie is a workflow engine and a
 scheduler.  It's core competencies are the ability to coordinate workflows
 of disparate job types (MR, Pig, Hive, etc.) and to schedule them.  It is
 not intended as an execution engine for apps such as Pig and Hive.
 
 I am not familiar with these other engines, but the short answer is that
 Tez is built to work on YARN, which works well for Hive since it is tied to
 Hadoop.
 
 3) When can we expect the first tez release?
 I don't know, but I hope sometime this fall.
 
 
 4) How much effort is involved in integrating hive and tez?
 Covered in the design doc.
 
 
 5) Who is ready to commit to this effort?
 I'll let people speak for themselves on that one.
 
 
 6) can we expect this work to be done in one hive release?
 Unlikely.  Initial integration will be done in one release, but as Tez is
 a new project I expect it will be adding features in the future that Hive
 will want to take advantage of.
 
 
 In my opinion we should not start any work on this tez-hive until these
 questions are answered to the satisfaction of the hive developers.
 
 Can we change this to not commit patches?  We can't tell willing people
 not to work on it.
 
 
 
 
 
 

Re: Tez branch and tez based patches

2013-08-05 Thread Alan Gates

On Jul 29, 2013, at 9:53 PM, Edward Capriolo wrote:

 Also watched http://www.ustream.tv/recorded/36323173
 
 I definitely see the win in being able to stream inter-stage output.
 
 I see some cases where small intermediate results can be kept In memory.
 But I was somewhat under the impression that the map reduce spill settings
 kept stuff in memory, isn't that what spill settings are?

No.  MapReduce always writes shuffle data to local disk.  And intermediate 
results between MR jobs are always persisted to HDFS, as there's no other 
option.  When we talk of being able to keep intermediate results in memory we 
mean getting rid of both of these disk writes/reads when appropriate (meaning 
not always, there's a trade off between speed and error handling to be made 
here, see below for more details).

 
 There is a few bullet points that came up repeatedly that I do not follow:
 
 Something was said to the effect of Container reuse makes X faster.
 Hadoop has jvm reuse. Not following what the difference is here? Not
 everyone has a 10K node cluster.

Sharing JVMs across users is inherently insecure (we can't guarantee what code 
the first user left behind that may interfere with later users).  As I 
understand container re-use in Tez it constrains the re-use to one user for 
security reasons, but still avoids additional JVM start up costs.  But this is 
a question that the Tez guys could answer better on the Tez lists 
(d...@tez.incubator.apache.org)

 
 Joins in map reduce are hard Really? I mean some of them are I guess, but
 the typical join is very easy. Just shuffle by the join key. There was not
 really enough low level details here saying why joins are better in tez.

Join is not a natural operation in MapReduce.  MR gives you one input and one 
output.  You end up having to bend the rules to do have multiple inputs.  The 
idea here is that Tez can provide operators that naturally work with joins and 
other operations that don't fit the one input/one output model (eg unions, 
etc.).

 
 Chosing the number of maps and reduces is hard Really? I do not find it
 that hard, I think there are times when it's not perfect but I do not find
 it hard. The talk did not really offer anything here technical on how tez
 makes this better other then it could make it better.

Perhaps manual would be a better term here than hard.  In our experience it 
takes quite a bit of engineer trial and error to determine the optimal numbers. 
 This may be ok if you're going to invest the time once and then run the same 
query every day for 6 months.  But obviously it doesn't work for the ad hoc 
case.  Even in the batch case it's not optimal because every once and a while 
an engineer has to go back and re-optimize the query to deal with changing data 
sizes, data characteristics, etc.  We want the optimizer to handle this without 
human intervention.

 
 The presentations mentioned streaming data, how do two nodes stream data
 between a tasks and how it it reliable? If the sender or receiver dies does
 the entire process have to start again?

If the sender or receiver dies then the query has to be restarted from some 
previous point where data was persisted to disk.  The idea here is that speed 
vs error recovery trade offs should be made by the optimizer.  If the optimizer 
estimates that a query will complete in 5 seconds it can stream everything and 
if a node fails it just re-runs the whole query.  If it estimates that a 
particular phase of a query will run for an hour it can choose to persist the 
results to HDFS so that in the event of a failure downstream the long phase 
need not be re-run.  Again we want this to be done automatically by the system 
so the user doesn't need to control this level of detail.

 
 Again one of the talks implied there is a prototype out there that launches
 hive jobs into tez. I would like to see that, it might answer more
 questions then a power point, and I could profile some common queries.

As mentioned in a previous email afaik Gunther's pushed all these changes to 
the Tez branch in Hive.

Alan.

 
 Random late night thoughts over,
 Ed
 
 
 
 
 
 
 On Tue, Jul 30, 2013 at 12:02 AM, Edward Capriolo 
 edlinuxg...@gmail.comwrote:
 
 At ~25:00
 
 There is a working prototype of hive which is using tez as the targeted
 runtime
 
 Can I get a look at that code? Is it on github?
 
 Edward
 
 
 On Wed, Jul 17, 2013 at 3:35 PM, Alan Gates ga...@hortonworks.com wrote:
 
 Answers to some of your questions inlined.
 
 Alan.
 
 On Jul 16, 2013, at 10:20 PM, Edward Capriolo wrote:
 
 There are some points I want to bring up. First, I am on the PMC. Here
 is
 something I find relevant:
 
 http://www.apache.org/foundation/how-it-works.html
 
 --
 
 The role of the PMC from a Foundation perspective is oversight. The main
 role of the PMC is not code and not coding - but to ensure that all
 legal
 issues are addressed, that procedure is followed, and that each and
 every
 

Re: Tez branch and tez based patches

2013-07-29 Thread Edward Capriolo
At ~25:00

There is a working prototype of hive which is using tez as the targeted
runtime

Can I get a look at that code? Is it on github?

Edward


On Wed, Jul 17, 2013 at 3:35 PM, Alan Gates ga...@hortonworks.com wrote:

 Answers to some of your questions inlined.

 Alan.

 On Jul 16, 2013, at 10:20 PM, Edward Capriolo wrote:

  There are some points I want to bring up. First, I am on the PMC. Here is
  something I find relevant:
 
  http://www.apache.org/foundation/how-it-works.html
 
  --
 
  The role of the PMC from a Foundation perspective is oversight. The main
  role of the PMC is not code and not coding - but to ensure that all legal
  issues are addressed, that procedure is followed, and that each and every
  release is the product of the community as a whole. That is key to our
  litigation protection mechanisms.
 
  Secondly the role of the PMC is to further the long term development and
  health of the community as a whole, and to ensure that balanced and wide
  scale peer review and collaboration does happen. Within the ASF we worry
  about any community which centers around a few individuals who are
 working
  virtually uncontested. We believe that this is detrimental to quality,
  stability, and robustness of both code and long term social structures.
 
  
 
 
 https://blogs.apache.org/comdev/entry/what_makes_apache_projects_different
 
  -
 
  All other decisions happen on the dev list, discussions on the private
 list
  are kept to a minimum.
 
  If it didn't happen on the dev list, it didn't happen - which leads to:
 
  a) Elections of committers and PMC members are published on the dev list
  once finalized.
 
  b) Out-of-band discussions (IRC etc.) are summarized on the dev list as
  soon as they have impact on the project, code or community.
  -
 
  https://issues.apache.org/jira/browse/HIVE-4660 ironically titled Let
  their be Tez has not be +1 ed by any committer. It was never discussed
 on
  the dev or the user list (as far as I can tell).

 As all JIRA creations and updates are sent to dev@hive, creating a JIRA
 is de facto posting to the list.

 
  As a PMC member I feel we need more discussion on Tez on the dev list
 along
  with a wiki-fied design document. Topics of discussion should include:

 I talked with Gunther and he's working on posting a design doc on the
 wiki.  He has a PDF on the JIRA but he doesn't have write permissions yet
 on the wiki.

 
  1) What is tez?
 In Hadoop 2.0, YARN opens up the ability to have multiple execution
 frameworks in Hadoop.  Hadoop apps are no longer tied to MapReduce as the
 only execution option.  Tez is an effort to build an execution engine that
 is optimized for relational data processing, such as Hive and Pig.

 The biggest change here is to move away from only Map and Reduce as
 processing options and to allow alternate combinations of processing, such
 as map - reduce - reduce or tasks that take multiple inputs or shuffles
 that avoid sorting when it isn't needed.

 For a good intro to Tez, see Arun's presentation on it at the recent
 Hadoop summit (video http://www.youtube.com/watch?v=9ZLLzlsz7h8 slides
 http://www.slideshare.net/Hadoop_Summit/murhty-saha-june26255pmroom212)
 
  2) How is tez different from oozie, http://code.google.com/p/hop/,
  http://cs.brown.edu/~backman/cmr.html , and other DAG and or streaming
 map
  reduce tools/frameworks? Why should we use this and not those?

 Oozie is a completely different thing.  Oozie is a workflow engine and a
 scheduler.  It's core competencies are the ability to coordinate workflows
 of disparate job types (MR, Pig, Hive, etc.) and to schedule them.  It is
 not intended as an execution engine for apps such as Pig and Hive.

 I am not familiar with these other engines, but the short answer is that
 Tez is built to work on YARN, which works well for Hive since it is tied to
 Hadoop.
 
  3) When can we expect the first tez release?
 I don't know, but I hope sometime this fall.

 
  4) How much effort is involved in integrating hive and tez?
 Covered in the design doc.

 
  5) Who is ready to commit to this effort?
 I'll let people speak for themselves on that one.

 
  6) can we expect this work to be done in one hive release?
 Unlikely.  Initial integration will be done in one release, but as Tez is
 a new project I expect it will be adding features in the future that Hive
 will want to take advantage of.

 
  In my opinion we should not start any work on this tez-hive until these
  questions are answered to the satisfaction of the hive developers.

 Can we change this to not commit patches?  We can't tell willing people
 not to work on it.
 
 
 
 
 
 
 
 
  On Mon, Jul 15, 2013 at 9:51 PM, Edward Capriolo edlinuxg...@gmail.com
 wrote:
 
 
  The Hive bylaws,
  https://cwiki.apache.org/confluence/display/Hive/Bylaws , lay out what
  votes are needed for 

Re: Tez branch and tez based patches

2013-07-29 Thread Edward Capriolo
Also watched http://www.ustream.tv/recorded/36323173

I definitely see the win in being able to stream inter-stage output.

I see some cases where small intermediate results can be kept In memory.
But I was somewhat under the impression that the map reduce spill settings
kept stuff in memory, isn't that what spill settings are?

There is a few bullet points that came up repeatedly that I do not follow:

Something was said to the effect of Container reuse makes X faster.
Hadoop has jvm reuse. Not following what the difference is here? Not
everyone has a 10K node cluster.

Joins in map reduce are hard Really? I mean some of them are I guess, but
the typical join is very easy. Just shuffle by the join key. There was not
really enough low level details here saying why joins are better in tez.

Chosing the number of maps and reduces is hard Really? I do not find it
that hard, I think there are times when it's not perfect but I do not find
it hard. The talk did not really offer anything here technical on how tez
makes this better other then it could make it better.

The presentations mentioned streaming data, how do two nodes stream data
between a tasks and how it it reliable? If the sender or receiver dies does
the entire process have to start again?

Again one of the talks implied there is a prototype out there that launches
hive jobs into tez. I would like to see that, it might answer more
questions then a power point, and I could profile some common queries.

Random late night thoughts over,
Ed






On Tue, Jul 30, 2013 at 12:02 AM, Edward Capriolo edlinuxg...@gmail.comwrote:

 At ~25:00

 There is a working prototype of hive which is using tez as the targeted
 runtime

 Can I get a look at that code? Is it on github?

 Edward


 On Wed, Jul 17, 2013 at 3:35 PM, Alan Gates ga...@hortonworks.com wrote:

 Answers to some of your questions inlined.

 Alan.

 On Jul 16, 2013, at 10:20 PM, Edward Capriolo wrote:

  There are some points I want to bring up. First, I am on the PMC. Here
 is
  something I find relevant:
 
  http://www.apache.org/foundation/how-it-works.html
 
  --
 
  The role of the PMC from a Foundation perspective is oversight. The main
  role of the PMC is not code and not coding - but to ensure that all
 legal
  issues are addressed, that procedure is followed, and that each and
 every
  release is the product of the community as a whole. That is key to our
  litigation protection mechanisms.
 
  Secondly the role of the PMC is to further the long term development and
  health of the community as a whole, and to ensure that balanced and wide
  scale peer review and collaboration does happen. Within the ASF we worry
  about any community which centers around a few individuals who are
 working
  virtually uncontested. We believe that this is detrimental to quality,
  stability, and robustness of both code and long term social structures.
 
  
 
 
 https://blogs.apache.org/comdev/entry/what_makes_apache_projects_different
 
  -
 
  All other decisions happen on the dev list, discussions on the private
 list
  are kept to a minimum.
 
  If it didn't happen on the dev list, it didn't happen - which leads
 to:
 
  a) Elections of committers and PMC members are published on the dev list
  once finalized.
 
  b) Out-of-band discussions (IRC etc.) are summarized on the dev list as
  soon as they have impact on the project, code or community.
  -
 
  https://issues.apache.org/jira/browse/HIVE-4660 ironically titled Let
  their be Tez has not be +1 ed by any committer. It was never discussed
 on
  the dev or the user list (as far as I can tell).

 As all JIRA creations and updates are sent to dev@hive, creating a JIRA
 is de facto posting to the list.

 
  As a PMC member I feel we need more discussion on Tez on the dev list
 along
  with a wiki-fied design document. Topics of discussion should include:

 I talked with Gunther and he's working on posting a design doc on the
 wiki.  He has a PDF on the JIRA but he doesn't have write permissions yet
 on the wiki.

 
  1) What is tez?
 In Hadoop 2.0, YARN opens up the ability to have multiple execution
 frameworks in Hadoop.  Hadoop apps are no longer tied to MapReduce as the
 only execution option.  Tez is an effort to build an execution engine that
 is optimized for relational data processing, such as Hive and Pig.

 The biggest change here is to move away from only Map and Reduce as
 processing options and to allow alternate combinations of processing, such
 as map - reduce - reduce or tasks that take multiple inputs or shuffles
 that avoid sorting when it isn't needed.

 For a good intro to Tez, see Arun's presentation on it at the recent
 Hadoop summit (video http://www.youtube.com/watch?v=9ZLLzlsz7h8 slides
 http://www.slideshare.net/Hadoop_Summit/murhty-saha-june26255pmroom212)
 
  2) How is tez different from oozie, 

Re: Tez branch and tez based patches

2013-07-22 Thread Gunther Hagleitner
I have finally gotten access to wiki and added the design doc:
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Tez

I've also added links to it from the jira and in general overhauled the
design. Please let me know if you feel there's still stuff missing from the
document.

 Possibly we should be thinking on how to build hive in such a way
 that many different frameworks could plug in.

I believe that the proposed design and refactoring puts you on that path.
I'm not introducing layer upon layer of abstraction without a specific use
case in mind, but high level you would go through similar steps:

Exec layer:
- Define your own Task classes
- If you can reuse the operator pipeline define your own replacement for
ExecMapper/ExecReducer (glue code to drive records through the pipeline)
- Operators: You might have to add specific operators for your framework

Planning:
- Define your own work classes (or reuse existing ones). These abstractly
encapsulate all input/meta info necessary to execute.
- Define your own *Compiler to translate either the logical plan or
physical plan to a graph of Tasks. This might include specific additional
optimizations.

Devil's in the details no doubt.

Thanks,
Gunther.






On Sat, Jul 20, 2013 at 8:10 AM, Edward Capriolo edlinuxg...@gmail.comwrote:

 I agree we are getting into grey area with the term disruptive. For
 reference ( I have not been doing this all the time bad on me) we are
 supposed to +1 and wait a day.

  I am not familiar with these other engines, but the short answer is that
  Tez is built to work on YARN, which works well for Hive since it is tied
  to Hadoop

 I understand what you are saying here yarn support is a plus. However the
 rest of the answer is something relevant to the discussion.

 There are already frameworks like spark that are semi popular.

 http://www.slideshare.net/jetlore/spark-and-shark-lightningfast-analytics-over-hadoop-and-hive-data
 .
 There are also other framworks like s4 http://incubator.apache.org/s4/, or
 storm.

 A big part of making a design decision is doing a competitive analysis.
 Usually asking yourself What else for this is already out there? or Can
 this be done other ways?
 I do want to be convinced we do not lock into tez too early with tunnel
 vision. Possibly we should be thinking on how to build hive in such a way
 that many different frameworks could plug in. In other words convincing
 that tez is the best choice, since many people are claiming an mrr type
 solution.

 I will watch the video you posted and study the material myself as well.


 On Wed, Jul 17, 2013 at 8:43 PM, Ashutosh Chauhan hashut...@apache.org
 wrote:

  On Wed, Jul 17, 2013 at 1:41 PM, Edward Capriolo edlinuxg...@gmail.com
  wrote:
 
  
   In my opinion we should limit the amount of tez related optimizations
 to
   and trunk Refactoring that cleans up code is good, but as you have
  pointed
   out there wont be a tez release until sometime this fall, and this
 branch
   will be open for an extended period of time. Thus code cleanups and
 other
   tez related refactoring does not need to be disruptive to trunk.
 
 
  I agree Tez specific changes need not to go in trunk. But general
  refactoring and code cleanup needs to happen on trunk as and when someone
  is willing to work on those. We have to continually improve our code
  quality. Code maintainability and readability is a priority. Without that
  code quality suffers and discourages new contributors to contribute
 because
  code is unnecessarily complicated. SemanticAnalyzer is 11K line class. We
  need to simplify it. Patch like HIVE-4811 is a welcome change which
 tackled
  it. Exec package is all convoluted which mixes up runtime operators and
  drivers for runtime. Thats a welcome patch because it makes it much more
  easy to read and reason about that piece of code. HIVE-4825 is another
  example which improves modularity of code. For contributors who are
 exposed
  to Hive first time it will be easier for them to follow the code.
 
  Rather than disruptive to trunk, they are constructive for trunk and I am
  glad people are choosing to work on that. Tez or no Tez Hive is better
 off
  with these patches.
 
  Thanks,
  Ashutosh
 
 
 
On Wed, Jul 17, 2013 at 3:35 PM, Alan Gates ga...@hortonworks.com
   wrote:
  
Answers to some of your questions inlined.
   
Alan.
   
On Jul 16, 2013, at 10:20 PM, Edward Capriolo wrote:
   
 There are some points I want to bring up. First, I am on the PMC.
  Here
   is
 something I find relevant:

 http://www.apache.org/foundation/how-it-works.html

 --

 The role of the PMC from a Foundation perspective is oversight. The
   main
 role of the PMC is not code and not coding - but to ensure that all
   legal
 issues are addressed, that procedure is followed, and that each and
   every
 release is the product of the community as a whole. That is key to
  our
 

Re: Tez branch and tez based patches

2013-07-20 Thread Edward Capriolo
I agree we are getting into grey area with the term disruptive. For
reference ( I have not been doing this all the time bad on me) we are
supposed to +1 and wait a day.

 I am not familiar with these other engines, but the short answer is that
 Tez is built to work on YARN, which works well for Hive since it is tied
 to Hadoop

I understand what you are saying here yarn support is a plus. However the
rest of the answer is something relevant to the discussion.

There are already frameworks like spark that are semi popular.
http://www.slideshare.net/jetlore/spark-and-shark-lightningfast-analytics-over-hadoop-and-hive-data.
There are also other framworks like s4 http://incubator.apache.org/s4/, or
storm.

A big part of making a design decision is doing a competitive analysis.
Usually asking yourself What else for this is already out there? or Can
this be done other ways?
I do want to be convinced we do not lock into tez too early with tunnel
vision. Possibly we should be thinking on how to build hive in such a way
that many different frameworks could plug in. In other words convincing
that tez is the best choice, since many people are claiming an mrr type
solution.

I will watch the video you posted and study the material myself as well.


On Wed, Jul 17, 2013 at 8:43 PM, Ashutosh Chauhan hashut...@apache.orgwrote:

 On Wed, Jul 17, 2013 at 1:41 PM, Edward Capriolo edlinuxg...@gmail.com
 wrote:

 
  In my opinion we should limit the amount of tez related optimizations to
  and trunk Refactoring that cleans up code is good, but as you have
 pointed
  out there wont be a tez release until sometime this fall, and this branch
  will be open for an extended period of time. Thus code cleanups and other
  tez related refactoring does not need to be disruptive to trunk.


 I agree Tez specific changes need not to go in trunk. But general
 refactoring and code cleanup needs to happen on trunk as and when someone
 is willing to work on those. We have to continually improve our code
 quality. Code maintainability and readability is a priority. Without that
 code quality suffers and discourages new contributors to contribute because
 code is unnecessarily complicated. SemanticAnalyzer is 11K line class. We
 need to simplify it. Patch like HIVE-4811 is a welcome change which tackled
 it. Exec package is all convoluted which mixes up runtime operators and
 drivers for runtime. Thats a welcome patch because it makes it much more
 easy to read and reason about that piece of code. HIVE-4825 is another
 example which improves modularity of code. For contributors who are exposed
 to Hive first time it will be easier for them to follow the code.

 Rather than disruptive to trunk, they are constructive for trunk and I am
 glad people are choosing to work on that. Tez or no Tez Hive is better off
 with these patches.

 Thanks,
 Ashutosh



   On Wed, Jul 17, 2013 at 3:35 PM, Alan Gates ga...@hortonworks.com
  wrote:
 
   Answers to some of your questions inlined.
  
   Alan.
  
   On Jul 16, 2013, at 10:20 PM, Edward Capriolo wrote:
  
There are some points I want to bring up. First, I am on the PMC.
 Here
  is
something I find relevant:
   
http://www.apache.org/foundation/how-it-works.html
   
--
   
The role of the PMC from a Foundation perspective is oversight. The
  main
role of the PMC is not code and not coding - but to ensure that all
  legal
issues are addressed, that procedure is followed, and that each and
  every
release is the product of the community as a whole. That is key to
 our
litigation protection mechanisms.
   
Secondly the role of the PMC is to further the long term development
  and
health of the community as a whole, and to ensure that balanced and
  wide
scale peer review and collaboration does happen. Within the ASF we
  worry
about any community which centers around a few individuals who are
   working
virtually uncontested. We believe that this is detrimental to
 quality,
stability, and robustness of both code and long term social
 structures.
   

   
   
  
 
 https://blogs.apache.org/comdev/entry/what_makes_apache_projects_different
   
-
   
All other decisions happen on the dev list, discussions on the
 private
   list
are kept to a minimum.
   
If it didn't happen on the dev list, it didn't happen - which leads
  to:
   
a) Elections of committers and PMC members are published on the dev
  list
once finalized.
   
b) Out-of-band discussions (IRC etc.) are summarized on the dev list
 as
soon as they have impact on the project, code or community.
-
   
https://issues.apache.org/jira/browse/HIVE-4660 ironically titled
 Let
their be Tez has not be +1 ed by any committer. It was never
 discussed
   on
the dev or the user list (as far as I can tell).
  
   As all 

Re: Tez branch and tez based patches

2013-07-17 Thread Alan Gates
Answers to some of your questions inlined.

Alan.

On Jul 16, 2013, at 10:20 PM, Edward Capriolo wrote:

 There are some points I want to bring up. First, I am on the PMC. Here is
 something I find relevant:
 
 http://www.apache.org/foundation/how-it-works.html
 
 --
 
 The role of the PMC from a Foundation perspective is oversight. The main
 role of the PMC is not code and not coding - but to ensure that all legal
 issues are addressed, that procedure is followed, and that each and every
 release is the product of the community as a whole. That is key to our
 litigation protection mechanisms.
 
 Secondly the role of the PMC is to further the long term development and
 health of the community as a whole, and to ensure that balanced and wide
 scale peer review and collaboration does happen. Within the ASF we worry
 about any community which centers around a few individuals who are working
 virtually uncontested. We believe that this is detrimental to quality,
 stability, and robustness of both code and long term social structures.
 
 
 
 https://blogs.apache.org/comdev/entry/what_makes_apache_projects_different
 
 -
 
 All other decisions happen on the dev list, discussions on the private list
 are kept to a minimum.
 
 If it didn't happen on the dev list, it didn't happen - which leads to:
 
 a) Elections of committers and PMC members are published on the dev list
 once finalized.
 
 b) Out-of-band discussions (IRC etc.) are summarized on the dev list as
 soon as they have impact on the project, code or community.
 -
 
 https://issues.apache.org/jira/browse/HIVE-4660 ironically titled Let
 their be Tez has not be +1 ed by any committer. It was never discussed on
 the dev or the user list (as far as I can tell).

As all JIRA creations and updates are sent to dev@hive, creating a JIRA is de 
facto posting to the list.  

 
 As a PMC member I feel we need more discussion on Tez on the dev list along
 with a wiki-fied design document. Topics of discussion should include:

I talked with Gunther and he's working on posting a design doc on the wiki.  He 
has a PDF on the JIRA but he doesn't have write permissions yet on the wiki.

 
 1) What is tez?
In Hadoop 2.0, YARN opens up the ability to have multiple execution frameworks 
in Hadoop.  Hadoop apps are no longer tied to MapReduce as the only execution 
option.  Tez is an effort to build an execution engine that is optimized for 
relational data processing, such as Hive and Pig.

The biggest change here is to move away from only Map and Reduce as processing 
options and to allow alternate combinations of processing, such as map - 
reduce - reduce or tasks that take multiple inputs or shuffles that avoid 
sorting when it isn't needed.

For a good intro to Tez, see Arun's presentation on it at the recent Hadoop 
summit (video http://www.youtube.com/watch?v=9ZLLzlsz7h8 slides 
http://www.slideshare.net/Hadoop_Summit/murhty-saha-june26255pmroom212)
 
 2) How is tez different from oozie, http://code.google.com/p/hop/,
 http://cs.brown.edu/~backman/cmr.html , and other DAG and or streaming map
 reduce tools/frameworks? Why should we use this and not those?

Oozie is a completely different thing.  Oozie is a workflow engine and a 
scheduler.  It's core competencies are the ability to coordinate workflows of 
disparate job types (MR, Pig, Hive, etc.) and to schedule them.  It is not 
intended as an execution engine for apps such as Pig and Hive.  

I am not familiar with these other engines, but the short answer is that Tez is 
built to work on YARN, which works well for Hive since it is tied to Hadoop.
 
 3) When can we expect the first tez release?
I don't know, but I hope sometime this fall.

 
 4) How much effort is involved in integrating hive and tez?
Covered in the design doc.

 
 5) Who is ready to commit to this effort?
I'll let people speak for themselves on that one.

 
 6) can we expect this work to be done in one hive release?
Unlikely.  Initial integration will be done in one release, but as Tez is a new 
project I expect it will be adding features in the future that Hive will want 
to take advantage of.

 
 In my opinion we should not start any work on this tez-hive until these
 questions are answered to the satisfaction of the hive developers.

Can we change this to not commit patches?  We can't tell willing people not 
to work on it.
 
 
 
 
 
 
 
 
 On Mon, Jul 15, 2013 at 9:51 PM, Edward Capriolo edlinuxg...@gmail.comwrote:
 
 
 The Hive bylaws,
 https://cwiki.apache.org/confluence/display/Hive/Bylaws , lay out what
 votes are needed for what.  I don't see anything there about needing 3 +1s
 for a branch.  Branching would seem to fall under code change, which
 requires one vote and a minimum length of 1 day.
 
 You could argue that all you need is one +1 to create a branch, but this
 is more then a branch. If you are 

Re: Tez branch and tez based patches

2013-07-17 Thread Edward Capriolo
 As all JIRA creations and updates are sent to dev@hive, creating a JIRA
is de facto posting to the list.

Agreed (although several ticket names are non descriptive). Possibly more
out-of-band discussions need to be summarized on list.

Yes. I will restart this:

In my opinion we should not start any work on this tez-hive until these
questions are answered to the satisfaction of the hive developers.

In my opinion we should limit the amount of tez related optimizations to
and trunk Refactoring that cleans up code is good, but as you have pointed
out there wont be a tez release until sometime this fall, and this branch
will be open for an extended period of time. Thus code cleanups and other
tez related refactoring does not need to be disruptive to trunk.

I have another relevant question, which I already probably know the answer
to, but I will ask it anyway.

Because tez is a YARN application, does this mean that Tez will be the
first hive feature that will require YARN? (It seems like the answer is yes)



On Wed, Jul 17, 2013 at 3:35 PM, Alan Gates ga...@hortonworks.com wrote:

 Answers to some of your questions inlined.

 Alan.

 On Jul 16, 2013, at 10:20 PM, Edward Capriolo wrote:

  There are some points I want to bring up. First, I am on the PMC. Here is
  something I find relevant:
 
  http://www.apache.org/foundation/how-it-works.html
 
  --
 
  The role of the PMC from a Foundation perspective is oversight. The main
  role of the PMC is not code and not coding - but to ensure that all legal
  issues are addressed, that procedure is followed, and that each and every
  release is the product of the community as a whole. That is key to our
  litigation protection mechanisms.
 
  Secondly the role of the PMC is to further the long term development and
  health of the community as a whole, and to ensure that balanced and wide
  scale peer review and collaboration does happen. Within the ASF we worry
  about any community which centers around a few individuals who are
 working
  virtually uncontested. We believe that this is detrimental to quality,
  stability, and robustness of both code and long term social structures.
 
  
 
 
 https://blogs.apache.org/comdev/entry/what_makes_apache_projects_different
 
  -
 
  All other decisions happen on the dev list, discussions on the private
 list
  are kept to a minimum.
 
  If it didn't happen on the dev list, it didn't happen - which leads to:
 
  a) Elections of committers and PMC members are published on the dev list
  once finalized.
 
  b) Out-of-band discussions (IRC etc.) are summarized on the dev list as
  soon as they have impact on the project, code or community.
  -
 
  https://issues.apache.org/jira/browse/HIVE-4660 ironically titled Let
  their be Tez has not be +1 ed by any committer. It was never discussed
 on
  the dev or the user list (as far as I can tell).

 As all JIRA creations and updates are sent to dev@hive, creating a JIRA
 is de facto posting to the list.

 
  As a PMC member I feel we need more discussion on Tez on the dev list
 along
  with a wiki-fied design document. Topics of discussion should include:

 I talked with Gunther and he's working on posting a design doc on the
 wiki.  He has a PDF on the JIRA but he doesn't have write permissions yet
 on the wiki.

 
  1) What is tez?
 In Hadoop 2.0, YARN opens up the ability to have multiple execution
 frameworks in Hadoop.  Hadoop apps are no longer tied to MapReduce as the
 only execution option.  Tez is an effort to build an execution engine that
 is optimized for relational data processing, such as Hive and Pig.

 The biggest change here is to move away from only Map and Reduce as
 processing options and to allow alternate combinations of processing, such
 as map - reduce - reduce or tasks that take multiple inputs or shuffles
 that avoid sorting when it isn't needed.

 For a good intro to Tez, see Arun's presentation on it at the recent
 Hadoop summit (video http://www.youtube.com/watch?v=9ZLLzlsz7h8 slides
 http://www.slideshare.net/Hadoop_Summit/murhty-saha-june26255pmroom212)
 
  2) How is tez different from oozie, http://code.google.com/p/hop/,
  http://cs.brown.edu/~backman/cmr.html , and other DAG and or streaming
 map
  reduce tools/frameworks? Why should we use this and not those?

 Oozie is a completely different thing.  Oozie is a workflow engine and a
 scheduler.  It's core competencies are the ability to coordinate workflows
 of disparate job types (MR, Pig, Hive, etc.) and to schedule them.  It is
 not intended as an execution engine for apps such as Pig and Hive.

 I am not familiar with these other engines, but the short answer is that
 Tez is built to work on YARN, which works well for Hive since it is tied to
 Hadoop.
 
  3) When can we expect the first tez release?
 I don't know, but I hope sometime this fall.

 
  4) How much 

Re: Tez branch and tez based patches

2013-07-17 Thread Alan Gates

On Jul 17, 2013, at 1:41 PM, Edward Capriolo wrote:

 
 In my opinion we should limit the amount of tez related optimizations to
 and trunk Refactoring that cleans up code is good, but as you have pointed
 out there wont be a tez release until sometime this fall, and this branch
 will be open for an extended period of time. Thus code cleanups and other
 tez related refactoring does not need to be disruptive to trunk.

I agree with this, though I suspect people will end up arguing about the 
meaning of code cleanup and disruptive.  In my discussions with Gunther he 
said he was doing code cleanup and it was not disruptive.  You obviously 
disagreed.  I've already suggested that any future patches that break lots of 
others should have their checkin preceded by a few hours notice that the patch 
will break things so others can say something if they are about to check in 
too.  I'd also be interested to hear from Gunther how much more general cleanup 
he feels is necessary on trunk.

 
 I have another relevant question, which I already probably know the answer
 to, but I will ask it anyway.
 
 Because tez is a YARN application, does this mean that Tez will be the
 first hive feature that will require YARN? (It seems like the answer is yes)

Yes, it will only work in the Hadoop 2.x world.  So obviously all this work 
needs to be done in a way that still allows Hive to use the MR execution engine 
in the Hadoop 1.x world.

Alan.



Re: Tez branch and tez based patches

2013-07-17 Thread Ashutosh Chauhan
On Wed, Jul 17, 2013 at 1:41 PM, Edward Capriolo edlinuxg...@gmail.comwrote:


 In my opinion we should limit the amount of tez related optimizations to
 and trunk Refactoring that cleans up code is good, but as you have pointed
 out there wont be a tez release until sometime this fall, and this branch
 will be open for an extended period of time. Thus code cleanups and other
 tez related refactoring does not need to be disruptive to trunk.


I agree Tez specific changes need not to go in trunk. But general
refactoring and code cleanup needs to happen on trunk as and when someone
is willing to work on those. We have to continually improve our code
quality. Code maintainability and readability is a priority. Without that
code quality suffers and discourages new contributors to contribute because
code is unnecessarily complicated. SemanticAnalyzer is 11K line class. We
need to simplify it. Patch like HIVE-4811 is a welcome change which tackled
it. Exec package is all convoluted which mixes up runtime operators and
drivers for runtime. Thats a welcome patch because it makes it much more
easy to read and reason about that piece of code. HIVE-4825 is another
example which improves modularity of code. For contributors who are exposed
to Hive first time it will be easier for them to follow the code.

Rather than disruptive to trunk, they are constructive for trunk and I am
glad people are choosing to work on that. Tez or no Tez Hive is better off
with these patches.

Thanks,
Ashutosh



  On Wed, Jul 17, 2013 at 3:35 PM, Alan Gates ga...@hortonworks.com
 wrote:

  Answers to some of your questions inlined.
 
  Alan.
 
  On Jul 16, 2013, at 10:20 PM, Edward Capriolo wrote:
 
   There are some points I want to bring up. First, I am on the PMC. Here
 is
   something I find relevant:
  
   http://www.apache.org/foundation/how-it-works.html
  
   --
  
   The role of the PMC from a Foundation perspective is oversight. The
 main
   role of the PMC is not code and not coding - but to ensure that all
 legal
   issues are addressed, that procedure is followed, and that each and
 every
   release is the product of the community as a whole. That is key to our
   litigation protection mechanisms.
  
   Secondly the role of the PMC is to further the long term development
 and
   health of the community as a whole, and to ensure that balanced and
 wide
   scale peer review and collaboration does happen. Within the ASF we
 worry
   about any community which centers around a few individuals who are
  working
   virtually uncontested. We believe that this is detrimental to quality,
   stability, and robustness of both code and long term social structures.
  
   
  
  
 
 https://blogs.apache.org/comdev/entry/what_makes_apache_projects_different
  
   -
  
   All other decisions happen on the dev list, discussions on the private
  list
   are kept to a minimum.
  
   If it didn't happen on the dev list, it didn't happen - which leads
 to:
  
   a) Elections of committers and PMC members are published on the dev
 list
   once finalized.
  
   b) Out-of-band discussions (IRC etc.) are summarized on the dev list as
   soon as they have impact on the project, code or community.
   -
  
   https://issues.apache.org/jira/browse/HIVE-4660 ironically titled Let
   their be Tez has not be +1 ed by any committer. It was never discussed
  on
   the dev or the user list (as far as I can tell).
 
  As all JIRA creations and updates are sent to dev@hive, creating a JIRA
  is de facto posting to the list.
 
  
   As a PMC member I feel we need more discussion on Tez on the dev list
  along
   with a wiki-fied design document. Topics of discussion should include:
 
  I talked with Gunther and he's working on posting a design doc on the
  wiki.  He has a PDF on the JIRA but he doesn't have write permissions yet
  on the wiki.
 
  
   1) What is tez?
  In Hadoop 2.0, YARN opens up the ability to have multiple execution
  frameworks in Hadoop.  Hadoop apps are no longer tied to MapReduce as the
  only execution option.  Tez is an effort to build an execution engine
 that
  is optimized for relational data processing, such as Hive and Pig.
 
  The biggest change here is to move away from only Map and Reduce as
  processing options and to allow alternate combinations of processing,
 such
  as map - reduce - reduce or tasks that take multiple inputs or shuffles
  that avoid sorting when it isn't needed.
 
  For a good intro to Tez, see Arun's presentation on it at the recent
  Hadoop summit (video http://www.youtube.com/watch?v=9ZLLzlsz7h8 slides
  http://www.slideshare.net/Hadoop_Summit/murhty-saha-june26255pmroom212)
  
   2) How is tez different from oozie, http://code.google.com/p/hop/,
   http://cs.brown.edu/~backman/cmr.html , and other DAG and or streaming
  map
   reduce tools/frameworks? Why should we use this and 

Re: Tez branch and tez based patches

2013-07-16 Thread Alan Gates
Ed,

I'm not sure I understand your argument, so I'm going to try to restate it.  
Please tell me if I understand it correctly.

I think you're saying we should not embark on big projects in Hive because:
1) There were big projects in the past that were abandoned or are not currently 
making progress (such as Oracle integration, Hive StorageHandler)
2) There are other big projects going on (ORC, Vectorization)
3) There are lots of out standing patches that need to be dealt with.

I would respond with two points to this.

First, I agree that the large out standing patch count is very bad.  It keeps 
people from getting involved in Hive.  It deprives Hive of fixes and 
improvements it would otherwise have.  Several of the committers are working to 
address this by checking in peoples' patches, but they are unable to keep up.  
The best solution is to encourage other committers to check in patches as well 
and to find willing and able contributors and mentor them to committership as 
quickly as possible.

Second, the way Apache works is that contributors scratch the itch that bothers 
them. So to argue We shouldn't do X because we never finished Y or We 
shouldn't do X because we're doing Y (where X and Y are independent) is not 
valid in Apache projects.  It's fine to argue that Tez hasn't been adequately 
explained (I think you hinted at this in previous emails) and ask for 
clarifications on what it is and what the planned changes are.  If after a full 
explanation you think it's a bad idea it's fine to argue Tez is the wrong 
direction for Hive and try to convince the rest of the community.  But assuming 
the community accepts that Tez is a reasonable direction and there are 
volunteers who want to do the work, then you can't argue they should work on 
something else instead.

Alan.

On Jul 15, 2013, at 6:51 PM, Edward Capriolo wrote:

 The Hive bylaws,  https://cwiki.apache.org/confluence/display/Hive/Bylaws, 
 lay out what votes are needed for what.  I don't see anything there about
 needing 3 +1s for a branch.  Branching would seem to fall under code
 change, which requires one vote and a minimum length of 1 day.
 
 You could argue that all you need is one +1 to create a branch, but this is
 more then a branch. If you are talking about something that is:
 1) going to cause major re-factoring of critical pieces of hive like
 ExecDriver and MapRedTask
 2) going to be very disruptive to the efforts of other committers
 3) something that may be a major architectural change
 
 Getting the project on board with the idea is a good idea.
 
 Now I want to point something out. Here are some recent initiatives in hive:
 
 1) At one point there was a big initiative to support oracle after the
 initial work, there are patches in Jira no one seems to care about oracle
 support.
 2) Another such decisions was this support windows one, there are
 probably 4 windows patches waiting reviews.
 3) I still have no clue what the official hadoop1 hadoop2, hadoop 0.23
 support prospective is, but every couple weeks we get another jira about
 something not working/testing on one of those versions, seems like several
 builds are broken.
 4) Hive-storage handler, after the initial implementation no one cares to
 review any other storage handler implementation, 3 patches there or more,
 could not even find anyone willing to review the cassandra storage handler
 I spent months on.
 5) OCR, Vectorization
 6) Windowing: committed, numerous check-style violations.
 
 We have !!!160+!!! PATCH_AVAILABLE Jira issues. Few active committers. We
 are spread very thin, and embarking on another side project not involved
 with core hive seems like the wrong direction at the moment.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 On Mon, Jul 15, 2013 at 8:37 PM, Alan Gates ga...@hortonworks.com wrote:
 
 
 On Jul 13, 2013, at 9:48 AM, Edward Capriolo wrote:
 
 I have started to see several re factoring patches around tez.
 https://issues.apache.org/jira/browse/HIVE-4843
 
 This is the only mention on the hive list I can find with tez:
 Makes sense. I will create the branch soon.
 
 Thanks,
 Ashutosh
 
 
 On Tue, Jun 11, 2013 at 7:44 PM, Gunther Hagleitner 
 ghagleit...@hortonworks.com wrote:
 
 Hi,
 
 I am starting to work on integrating Tez into Hive (see HIVE-4660,
 design
 doc has already been uploaded - any feedback will be much appreciated).
 This will be a fair amount of work that will take time to
 stabilize/test.
 I'd like to propose creating a branch in order to be able to do this
 incrementally and collaboratively. In order to progress rapidly with
 this,
 I would also like to go commit-then-review.
 
 Thanks,
 Gunther.
 
 
 These refactor-ings are largely destructive to a number of bugs and
 language improvements in hive.The language improvements and bug fixes
 that
 have been sitting in Jira for quite some time now marked patch-available
 and are waiting for review.
 
 There are a few things I want to point out:
 1) Normally we create design 

Re: Tez branch and tez based patches

2013-07-16 Thread Edward Capriolo
Alan,

I agree with all your statements, with the exception of one.

Second, the way Apache works is that contributors scratch the itch that =
bothers them. So to argue We shouldn't do X because we never finished =
Y or We shouldn't do X because we're doing Y (where X and Y are =
independent) is not valid in Apache projects.

I disagree, look at this:

https://issues.apache.org/jira/browse/HIVE-3585

A contribution was immediately met with a -1.

I personally have had issues closed as WONT FIX, LATER across a variety
of apache projects because said committers decided the feature was out of
scope, or whatever.

Arguing that if one contributer wants to scratch an itch we should allow
it in the project is not practical. Because we have to be able to maintain
hive after the itch scratcher finds a new itch, and moves on. Hive is not
project hosting for every cool idea.

This was why I mentioned things like windows support, I do not think
there was ever a point where the committers/PMC agreed that windows
support was something we all wanted to work towards. I can not pin down
how the initiative started and why. Now whoever started that ball rolling
has moved on. I do not own a windows computer, we have no apache
infrastructure to test hive on windows. Jira issues stay open, those of us
in it for the long haul and up holding the ball, and supporting things we
never explicitly wanted.

As this relates to Tez, tez is in the incubator. Hive is release quality
software. I am not convinced Tez is the direction we should go in. I am
scared of it going the path of windows support or oracle support,
because someone scratching an itch and we (the committers) do not have
enough information, about the changes involved, the timeline, what types of
use cases will benefit from this feature.

Tez refactoring are getting filed as 'MAJOR' 'BUGS' and getting committed
to trunk, when they are 'IMPROVEMENTS' that are 'LOW' priority. I do not
understand why there is such a priority to merge code into trunk, when we
can all see this branch is going to be opened for a long time and be rather
involved. Even then I would not mind if it was not largely unfair to
everyone else that now needs to rebase.








On Tue, Jul 16, 2013 at 2:24 PM, Alan Gates ga...@hortonworks.com wrote:

 Ed,

 I'm not sure I understand your argument, so I'm going to try to restate
 it.  Please tell me if I understand it correctly.

 I think you're saying we should not embark on big projects in Hive because:
 1) There were big projects in the past that were abandoned or are not
 currently making progress (such as Oracle integration, Hive StorageHandler)
 2) There are other big projects going on (ORC, Vectorization)
 3) There are lots of out standing patches that need to be dealt with.

 I would respond with two points to this.

 First, I agree that the large out standing patch count is very bad.  It
 keeps people from getting involved in Hive.  It deprives Hive of fixes and
 improvements it would otherwise have.  Several of the committers are
 working to address this by checking in peoples' patches, but they are
 unable to keep up.  The best solution is to encourage other committers to
 check in patches as well and to find willing and able contributors and
 mentor them to committership as quickly as possible.

 Second, the way Apache works is that contributors scratch the itch that
 bothers them. So to argue We shouldn't do X because we never finished Y
 or We shouldn't do X because we're doing Y (where X and Y are
 independent) is not valid in Apache projects.  It's fine to argue that Tez
 hasn't been adequately explained (I think you hinted at this in previous
 emails) and ask for clarifications on what it is and what the planned
 changes are.  If after a full explanation you think it's a bad idea it's
 fine to argue Tez is the wrong direction for Hive and try to convince the
 rest of the community.  But assuming the community accepts that Tez is a
 reasonable direction and there are volunteers who want to do the work, then
 you can't argue they should work on something else instead.

 Alan.

 On Jul 15, 2013, at 6:51 PM, Edward Capriolo wrote:

  The Hive bylaws,
 https://cwiki.apache.org/confluence/display/Hive/Bylaws, lay out what
 votes are needed for what.  I don't see anything there about
  needing 3 +1s for a branch.  Branching would seem to fall under code
  change, which requires one vote and a minimum length of 1 day.
 
  You could argue that all you need is one +1 to create a branch, but this
 is
  more then a branch. If you are talking about something that is:
  1) going to cause major re-factoring of critical pieces of hive like
  ExecDriver and MapRedTask
  2) going to be very disruptive to the efforts of other committers
  3) something that may be a major architectural change
 
  Getting the project on board with the idea is a good idea.
 
  Now I want to point something out. Here are some recent initiatives in
 hive:
 
  1) At one point there 

Re: Tez branch and tez based patches

2013-07-16 Thread Edward Capriolo
There are some points I want to bring up. First, I am on the PMC. Here is
something I find relevant:

http://www.apache.org/foundation/how-it-works.html

--

The role of the PMC from a Foundation perspective is oversight. The main
role of the PMC is not code and not coding - but to ensure that all legal
issues are addressed, that procedure is followed, and that each and every
release is the product of the community as a whole. That is key to our
litigation protection mechanisms.

Secondly the role of the PMC is to further the long term development and
health of the community as a whole, and to ensure that balanced and wide
scale peer review and collaboration does happen. Within the ASF we worry
about any community which centers around a few individuals who are working
virtually uncontested. We believe that this is detrimental to quality,
stability, and robustness of both code and long term social structures.



https://blogs.apache.org/comdev/entry/what_makes_apache_projects_different

-

All other decisions happen on the dev list, discussions on the private list
are kept to a minimum.

If it didn't happen on the dev list, it didn't happen - which leads to:

a) Elections of committers and PMC members are published on the dev list
once finalized.

b) Out-of-band discussions (IRC etc.) are summarized on the dev list as
soon as they have impact on the project, code or community.
-

https://issues.apache.org/jira/browse/HIVE-4660 ironically titled Let
their be Tez has not be +1 ed by any committer. It was never discussed on
the dev or the user list (as far as I can tell).

As a PMC member I feel we need more discussion on Tez on the dev list along
with a wiki-fied design document. Topics of discussion should include:

1) What is tez?

2) How is tez different from oozie, http://code.google.com/p/hop/,
http://cs.brown.edu/~backman/cmr.html , and other DAG and or streaming map
reduce tools/frameworks? Why should we use this and not those?

3) When can we expect the first tez release?

4) How much effort is involved in integrating hive and tez?

5) Who is ready to commit to this effort?

6) can we expect this work to be done in one hive release?

In my opinion we should not start any work on this tez-hive until these
questions are answered to the satisfaction of the hive developers.








On Mon, Jul 15, 2013 at 9:51 PM, Edward Capriolo edlinuxg...@gmail.comwrote:


 The Hive bylaws,
 https://cwiki.apache.org/confluence/display/Hive/Bylaws , lay out what
 votes are needed for what.  I don't see anything there about needing 3 +1s
 for a branch.  Branching would seem to fall under code change, which
 requires one vote and a minimum length of 1 day.

 You could argue that all you need is one +1 to create a branch, but this
 is more then a branch. If you are talking about something that is:
 1) going to cause major re-factoring of critical pieces of hive like
 ExecDriver and MapRedTask
 2) going to be very disruptive to the efforts of other committers
 3) something that may be a major architectural change

 Getting the project on board with the idea is a good idea.

 Now I want to point something out. Here are some recent initiatives in
 hive:

 1) At one point there was a big initiative to support oracle after the
 initial work, there are patches in Jira no one seems to care about oracle
 support.
 2) Another such decisions was this support windows one, there are
 probably 4 windows patches waiting reviews.
 3) I still have no clue what the official hadoop1 hadoop2, hadoop 0.23
 support prospective is, but every couple weeks we get another jira about
 something not working/testing on one of those versions, seems like several
 builds are broken.
 4) Hive-storage handler, after the initial implementation no one cares to
 review any other storage handler implementation, 3 patches there or more,
 could not even find anyone willing to review the cassandra storage handler
 I spent months on.
 5) OCR, Vectorization
 6) Windowing: committed, numerous check-style violations.

 We have !!!160+!!! PATCH_AVAILABLE Jira issues. Few active committers. We
 are spread very thin, and embarking on another side project not involved
 with core hive seems like the wrong direction at the moment.
















 On Mon, Jul 15, 2013 at 8:37 PM, Alan Gates ga...@hortonworks.com wrote:


 On Jul 13, 2013, at 9:48 AM, Edward Capriolo wrote:

  I have started to see several re factoring patches around tez.
  https://issues.apache.org/jira/browse/HIVE-4843
 
  This is the only mention on the hive list I can find with tez:
  Makes sense. I will create the branch soon.
 
  Thanks,
  Ashutosh
 
 
  On Tue, Jun 11, 2013 at 7:44 PM, Gunther Hagleitner 
  ghagleit...@hortonworks.com wrote:
 
  Hi,
 
  I am starting to work on integrating Tez into Hive (see HIVE-4660,
 design
  doc has already been uploaded - any feedback 

Re: Tez branch and tez based patches

2013-07-15 Thread Alan Gates

On Jul 13, 2013, at 9:48 AM, Edward Capriolo wrote:

 I have started to see several re factoring patches around tez.
 https://issues.apache.org/jira/browse/HIVE-4843
 
 This is the only mention on the hive list I can find with tez:
 Makes sense. I will create the branch soon.
 
 Thanks,
 Ashutosh
 
 
 On Tue, Jun 11, 2013 at 7:44 PM, Gunther Hagleitner 
 ghagleit...@hortonworks.com wrote:
 
 Hi,
 
 I am starting to work on integrating Tez into Hive (see HIVE-4660, design
 doc has already been uploaded - any feedback will be much appreciated).
 This will be a fair amount of work that will take time to stabilize/test.
 I'd like to propose creating a branch in order to be able to do this
 incrementally and collaboratively. In order to progress rapidly with this,
 I would also like to go commit-then-review.
 
 Thanks,
 Gunther.
 
 
 These refactor-ings are largely destructive to a number of bugs and
 language improvements in hive.The language improvements and bug fixes that
 have been sitting in Jira for quite some time now marked patch-available
 and are waiting for review.
 
 There are a few things I want to point out:
 1) Normally we create design docs in out wiki (which it is not)
 2) Normally when the change is significantly complex we get multiple
 committers to comment on it (which we did not)
 On point 2 no one -1  the branch, but this is really something that should
 have required a +1 from 3 committers.

The Hive bylaws,  https://cwiki.apache.org/confluence/display/Hive/Bylaws , lay 
out what votes are needed for what.  I don't see anything there about needing 3 
+1s for a branch.  Branching would seem to fall under code change, which 
requires one vote and a minimum length of 1 day.

 
 I for one am not completely sold on Tez.
 http://incubator.apache.org/projects/tez.html.
 directed-acyclic-graph of tasks for processing data this description
 sounds like many things which have never become popular. One to think of is
 oozie Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of
 actions.. I am sure I can find a number of libraries/frameworks that make
 this same claim. In general I do not feel like we have done our homework
 and pre-requisites to justify all this work. If we have done the homework,
 I am sure that it has not been communicated and accepted by hive developers
 at large.

A request for better documentation on Tez and a project road map seems totally 
reasonable.

 
 If we have a branch, why are we also committing on trunk? Scanning through
 the tez doc the only language I keep finding language like minimal changes
 to the planner yet, there is ALREADY lots of large changes going on!
 
 Really none of the above would bother me accept for the fact that these
 minimal changes are causing many patch available ready-for-review bugs
 and core hive features to need to be re based.
 
 I am sure I have mentioned this before, but I have to spend 12+ hours to
 test a single patch on my laptop. A few days ago I was testing a new core
 hive feature. After all the tests passed and before I was able to commit,
 someone unleashed a tez patch on trunk which caused the thing I was testing
 for 12 hours to need to be rebased.
 
 
 I'm not cool with this.Next time that happens to me I will seriously
 consider reverting the patch. Bug fixes and new hive features are more
 important to me then integrating with incubator projects.

(With my Apache member hat on)  Reverting patches that aren't breaking the 
build is considered very bad form in Apache.  It does make sense to request 
that when people are going to commit a patch that will break many other patches 
they first give a few hours of notice so people can say something if they're 
about to commit another patch and avoid your fate of needing to rerun the 
tests.  The other thing is we need to get get the automated build of patches 
working on Hive so committers are forced to run all of the tests themselves.  
We are working on it, but we're not there yet.

Alan.



Re: Tez branch and tez based patches

2013-07-15 Thread Edward Capriolo
The Hive bylaws,  https://cwiki.apache.org/confluence/display/Hive/Bylaws, 
lay out what votes are needed for what.  I don't see anything there about
needing 3 +1s for a branch.  Branching would seem to fall under code
change, which requires one vote and a minimum length of 1 day.

You could argue that all you need is one +1 to create a branch, but this is
more then a branch. If you are talking about something that is:
1) going to cause major re-factoring of critical pieces of hive like
ExecDriver and MapRedTask
2) going to be very disruptive to the efforts of other committers
3) something that may be a major architectural change

Getting the project on board with the idea is a good idea.

Now I want to point something out. Here are some recent initiatives in hive:

1) At one point there was a big initiative to support oracle after the
initial work, there are patches in Jira no one seems to care about oracle
support.
2) Another such decisions was this support windows one, there are
probably 4 windows patches waiting reviews.
3) I still have no clue what the official hadoop1 hadoop2, hadoop 0.23
support prospective is, but every couple weeks we get another jira about
something not working/testing on one of those versions, seems like several
builds are broken.
4) Hive-storage handler, after the initial implementation no one cares to
review any other storage handler implementation, 3 patches there or more,
could not even find anyone willing to review the cassandra storage handler
I spent months on.
5) OCR, Vectorization
6) Windowing: committed, numerous check-style violations.

We have !!!160+!!! PATCH_AVAILABLE Jira issues. Few active committers. We
are spread very thin, and embarking on another side project not involved
with core hive seems like the wrong direction at the moment.
















On Mon, Jul 15, 2013 at 8:37 PM, Alan Gates ga...@hortonworks.com wrote:


 On Jul 13, 2013, at 9:48 AM, Edward Capriolo wrote:

  I have started to see several re factoring patches around tez.
  https://issues.apache.org/jira/browse/HIVE-4843
 
  This is the only mention on the hive list I can find with tez:
  Makes sense. I will create the branch soon.
 
  Thanks,
  Ashutosh
 
 
  On Tue, Jun 11, 2013 at 7:44 PM, Gunther Hagleitner 
  ghagleit...@hortonworks.com wrote:
 
  Hi,
 
  I am starting to work on integrating Tez into Hive (see HIVE-4660,
 design
  doc has already been uploaded - any feedback will be much appreciated).
  This will be a fair amount of work that will take time to
 stabilize/test.
  I'd like to propose creating a branch in order to be able to do this
  incrementally and collaboratively. In order to progress rapidly with
 this,
  I would also like to go commit-then-review.
 
  Thanks,
  Gunther.
  
 
  These refactor-ings are largely destructive to a number of bugs and
  language improvements in hive.The language improvements and bug fixes
 that
  have been sitting in Jira for quite some time now marked patch-available
  and are waiting for review.
 
  There are a few things I want to point out:
  1) Normally we create design docs in out wiki (which it is not)
  2) Normally when the change is significantly complex we get multiple
  committers to comment on it (which we did not)
  On point 2 no one -1  the branch, but this is really something that
 should
  have required a +1 from 3 committers.

 The Hive bylaws,  https://cwiki.apache.org/confluence/display/Hive/Bylaws, 
 lay out what votes are needed for what.  I don't see anything there about
 needing 3 +1s for a branch.  Branching would seem to fall under code
 change, which requires one vote and a minimum length of 1 day.

 
  I for one am not completely sold on Tez.
  http://incubator.apache.org/projects/tez.html.
  directed-acyclic-graph of tasks for processing data this description
  sounds like many things which have never become popular. One to think of
 is
  oozie Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of
  actions.. I am sure I can find a number of libraries/frameworks that
 make
  this same claim. In general I do not feel like we have done our homework
  and pre-requisites to justify all this work. If we have done the
 homework,
  I am sure that it has not been communicated and accepted by hive
 developers
  at large.

 A request for better documentation on Tez and a project road map seems
 totally reasonable.

 
  If we have a branch, why are we also committing on trunk? Scanning
 through
  the tez doc the only language I keep finding language like minimal
 changes
  to the planner yet, there is ALREADY lots of large changes going on!
 
  Really none of the above would bother me accept for the fact that these
  minimal changes are causing many patch available ready-for-review
 bugs
  and core hive features to need to be re based.
 
  I am sure I have mentioned this before, but I have to spend 12+ hours to
  test a single patch on my laptop. A few days ago I was testing a new core
  hive feature. After all the