[jira] [Created] (HIVE-10798) Remove dependence on VectorizedBatchUtil from VectorizedOrcAcidRowReader

2015-05-22 Thread Owen O'Malley (JIRA)
Owen O'Malley created HIVE-10798:


 Summary: Remove dependence on VectorizedBatchUtil from 
VectorizedOrcAcidRowReader
 Key: HIVE-10798
 URL: https://issues.apache.org/jira/browse/HIVE-10798
 Project: Hive
  Issue Type: Sub-task
Reporter: Owen O'Malley
Assignee: Owen O'Malley


VectorizedBatchUtil has a lot of dependences that Orc should avoid and the code 
should be refactored.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Review Request 34473: HIVE-10749 Implement Insert statement for parquet

2015-05-22 Thread cheng xu


 On May 21, 2015, 7:18 p.m., Sergio Pena wrote:
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetOutputFormat.java,
   line 59
  https://reviews.apache.org/r/34473/diff/2/?file=966160#file966160line59
 
  Could you separate words with _? Like ENABLE_ACID_SCHEMA_INFO. It helps 
  to read the constant more easily.
  
  Do we have to enable transactions exclusively for parquet? Isn't there 
  another variable that enables trasnactions on Hive that we can use?

This variable is used for setting the schema for parquet. It's only related to 
whether you need to write data to base file or not. So we have to use this way 
to append the original data with ACID info.


 On May 21, 2015, 7:18 p.m., Sergio Pena wrote:
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetOutputFormat.java,
   lines 98-103
  https://reviews.apache.org/r/34473/diff/2/?file=966160#file966160line98
 
  You can use this one line to return the column list:
  
  return (ListString) 
  StringUtils.getStringCollection(tableProperties.getProperty(IOConstants.COLUMNS));
  
  It will return an empty list array if COLUMN is empty.

Great suggestion!


- cheng


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34473/#review84758
---


On May 22, 2015, 6:26 a.m., cheng xu wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/34473/
 ---
 
 (Updated May 22, 2015, 6:26 a.m.)
 
 
 Review request for hive, Alan Gates, Owen O'Malley, and Sergio Pena.
 
 
 Bugs: HIVE-10749
 https://issues.apache.org/jira/browse/HIVE-10749
 
 
 Repository: hive-git
 
 
 Description
 ---
 
 Implement the insert statement for parquet format.
 
 
 Diffs
 -
 
   
 ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetOutputFormat.java
  c6fb26c 
   
 ql/src/java/org/apache/hadoop/hive/ql/io/parquet/acid/ParquetRecordUpdater.java
  PRE-CREATION 
   
 ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/ParquetRecordReaderWrapper.java
  f513572 
   
 ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ObjectArrayWritableObjectInspector.java
  571f993 
   
 ql/src/test/org/apache/hadoop/hive/ql/io/parquet/acid/TestParquetRecordUpdater.java
  PRE-CREATION 
   ql/src/test/queries/clientpositive/acid_parquet_insert.q PRE-CREATION 
   ql/src/test/results/clientpositive/acid_parquet_insert.q.out PRE-CREATION 
 
 Diff: https://reviews.apache.org/r/34473/diff/
 
 
 Testing
 ---
 
 Newly added qtest and UT passed locally
 
 
 Thanks,
 
 cheng xu
 




[jira] [Created] (HIVE-10796) Remove dependencies on NumericHistogram and NumDistinctValueEstimator from JavaDataModel

2015-05-22 Thread Owen O'Malley (JIRA)
Owen O'Malley created HIVE-10796:


 Summary: Remove dependencies on NumericHistogram and 
NumDistinctValueEstimator from JavaDataModel
 Key: HIVE-10796
 URL: https://issues.apache.org/jira/browse/HIVE-10796
 Project: Hive
  Issue Type: Sub-task
Reporter: Owen O'Malley
Assignee: Owen O'Malley


The JavaDataModel class is used in a lot of places and the non-general 
calculations are better done in the other classes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Review Request 34473: HIVE-10749 Implement Insert statement for parquet

2015-05-22 Thread cheng xu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34473/
---

(Updated May 22, 2015, 6:26 a.m.)


Review request for hive, Alan Gates, Owen O'Malley, and Sergio Pena.


Changes
---

Summary:
1. use some utility to reduce LOC
2. remove *ParquetRecordReaderWrapper.java* and use 
*ObjectArrayWritableObjectInspector* instead


Bugs: HIVE-10749
https://issues.apache.org/jira/browse/HIVE-10749


Repository: hive-git


Description
---

Implement the insert statement for parquet format.


Diffs (updated)
-

  
ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetOutputFormat.java 
c6fb26c 
  
ql/src/java/org/apache/hadoop/hive/ql/io/parquet/acid/ParquetRecordUpdater.java 
PRE-CREATION 
  
ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/ParquetRecordReaderWrapper.java
 f513572 
  
ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ObjectArrayWritableObjectInspector.java
 571f993 
  
ql/src/test/org/apache/hadoop/hive/ql/io/parquet/acid/TestParquetRecordUpdater.java
 PRE-CREATION 
  ql/src/test/queries/clientpositive/acid_parquet_insert.q PRE-CREATION 
  ql/src/test/results/clientpositive/acid_parquet_insert.q.out PRE-CREATION 

Diff: https://reviews.apache.org/r/34473/diff/


Testing
---

Newly added qtest and UT passed locally


Thanks,

cheng xu



Re: Review Request 34455: HIVE-10550 Dynamic RDD caching optimization for HoS.[Spark Branch]

2015-05-22 Thread Alexander Pivovarov

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34455/#review84876
---



ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ReduceTran.java
https://reviews.apache.org/r/34455/#comment136299

use 2 spaces for indent



ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ReduceTran.java
https://reviews.apache.org/r/34455/#comment136300

use 2 spaces for indent


- Alexander Pivovarov


On May 22, 2015, 6:18 a.m., chengxiang li wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/34455/
 ---
 
 (Updated May 22, 2015, 6:18 a.m.)
 
 
 Review request for hive, Chao Sun, Jimmy Xiang, and Xuefu Zhang.
 
 
 Bugs: HIVE-10550
 https://issues.apache.org/jira/browse/HIVE-10550
 
 
 Repository: hive-git
 
 
 Description
 ---
 
 see jira description
 
 
 Diffs
 -
 
   common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 43c53fc 
   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/CacheTran.java 
 PRE-CREATION 
   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapTran.java 2170243 
   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ReduceTran.java e60dfac 
   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlan.java ee5c78a 
   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 
 3f240f5 
   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java 
 e6c845c 
   
 ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkRddCachingResolver.java
  PRE-CREATION 
   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 
 19aae70 
   ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java bb5dd79 
 
 Diff: https://reviews.apache.org/r/34455/diff/
 
 
 Testing
 ---
 
 
 Thanks,
 
 chengxiang li
 




[jira] [Created] (HIVE-10797) Simplify the test for vectorized input

2015-05-22 Thread Owen O'Malley (JIRA)
Owen O'Malley created HIVE-10797:


 Summary: Simplify the test for vectorized input
 Key: HIVE-10797
 URL: https://issues.apache.org/jira/browse/HIVE-10797
 Project: Hive
  Issue Type: Sub-task
Reporter: Owen O'Malley
Assignee: Owen O'Malley


The call to Utilities.isVectorMode should be simplified for the readers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-10795) Remove use of PerfLogger from Orc

2015-05-22 Thread Owen O'Malley (JIRA)
Owen O'Malley created HIVE-10795:


 Summary: Remove use of PerfLogger from Orc
 Key: HIVE-10795
 URL: https://issues.apache.org/jira/browse/HIVE-10795
 Project: Hive
  Issue Type: Sub-task
Reporter: Owen O'Malley
Assignee: Owen O'Malley


PerfLogger is yet another class with a huge dependency set that Orc doesn't 
need.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Review Request 34455: HIVE-10550 Dynamic RDD caching optimization for HoS.[Spark Branch]

2015-05-22 Thread chengxiang li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34455/
---

(Updated May 22, 2015, 6:18 a.m.)


Review request for hive, Chao Sun, Jimmy Xiang, and Xuefu Zhang.


Changes
---

Keep all the previous multi-insert cache code.


Bugs: HIVE-10550
https://issues.apache.org/jira/browse/HIVE-10550


Repository: hive-git


Description
---

see jira description


Diffs (updated)
-

  common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 43c53fc 
  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/CacheTran.java PRE-CREATION 
  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapTran.java 2170243 
  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ReduceTran.java e60dfac 
  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlan.java ee5c78a 
  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 
3f240f5 
  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java e6c845c 
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkRddCachingResolver.java
 PRE-CREATION 
  ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 19aae70 
  ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java bb5dd79 

Diff: https://reviews.apache.org/r/34455/diff/


Testing
---


Thanks,

chengxiang li



[jira] [Created] (HIVE-10799) Refactor the SearchArgumentFactory to remove the dependence on ExprNodeGenericFuncDesc

2015-05-22 Thread Owen O'Malley (JIRA)
Owen O'Malley created HIVE-10799:


 Summary: Refactor the SearchArgumentFactory to remove the 
dependence on ExprNodeGenericFuncDesc
 Key: HIVE-10799
 URL: https://issues.apache.org/jira/browse/HIVE-10799
 Project: Hive
  Issue Type: Sub-task
Reporter: Owen O'Malley
Assignee: Owen O'Malley


SearchArgumentFactory and SearchArgumentImpl are high level and shouldn't 
depend on the internals of Hive's AST model.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [DISCUSS] Supporting Hadoop-1 and experimental features

2015-05-22 Thread Chris Drome
I understand the motivation and benefits of creating a branch-2 where more 
disruptive work can go on without affecting branch-1. While not necessarily 
against this approach, from Yahoo's standpoint, I do have some questions 
(concerns).
Upgrading to a new version of Hive requires a significant commitment of time 
and resources to stabilize and certify a build for deployment to our clusters. 
Given the size of our clusters and scale of datasets, we have to be 
particularly careful about adopting new functionality. However, at the same 
time we are interested in new testing and making available new features and 
functionality. That said, we would have to rely on branch-1 for the immediate 
future.
One concern is that branch-1 would be left to stagnate, at which point there 
would be no option but for users to move to branch-2 as branch-1 would be 
effectively end-of-lifed. I'm not sure how long this would take, but it would 
eventually happen as a direct result of the very reason for creating branch-2.
A related concern is how disruptive the code changes will be in branch-2. I 
imagine that changes in early in branch-2 will be easy to backport to branch-1, 
while this effort will become more difficult, if not impractical, as time goes. 
If the code bases diverge too much then this could lead to more pressure for 
users of branch-1 to add features just to branch-1, which has been mentioned as 
undesirable. By the same token, backporting any code in branch-2 will require 
an increasing amount of effort, which contributors to branch-2 may not be 
interested in committing to.
These questions affect us directly because, while we require a certain amount 
of stability, we also like to pull in new functionality that will be of value 
to our users. For example, our current 0.13 release is probably closer to 0.14 
at this point. Given the lifespan of a release, it is often more palatable to 
backport features and bugfixes than to jump to a new version.

The good thing about this proposal is the opportunity to evaluate and clean up 
alot of the old code.
Thanks,
chris
 


 On Monday, May 18, 2015 11:48 AM, Sergey Shelukhin 
ser...@hortonworks.com wrote:
   

 Note: by “cannot” I mean “are unwilling to”; upgrade paths exist, but some
people are set in their ways or have practical considerations and don’t
care for new shiny stuff.

On 15/5/18, 11:46, Sergey Shelukhin ser...@hortonworks.com wrote:

I think we need some path for deprecating old Hadoop versions, the same
way we deprecate old Java version support or old RDBMS version support.
At some point the cost of supporting Hadoop 1 exceeds the benefit. Same
goes for stuff like MR; supporting it, esp. for perf work, becomes a
burden, and it’s outdated with 2 alternatives, one of which has been
around for 2 releases.
The branches are a graceful way to get rid of the legacy burden.

Alternatively, when sweeping changes are made, we can do what Hbase did
(which is not pretty imho), where 0.94 version had ~30 dot releases
because people cannot upgrade to 0.96 “singularity” release.


I posit that people who run Hadoop 1 and MR at this day and age (and more
so as time passes) are people who either don’t care about perf and new
features, only stability; so, stability-focused branch would be perfect to
support them.


On 15/5/18, 10:04, Edward Capriolo edlinuxg...@gmail.com wrote:

Up until recently Hive supported numerous versions of Hadoop code base
with
a simple shim layer. I would rather we stick to the shim layer. I think
this was easily the best part about hive was that a single release worked
well regardless of your hadoop version. It was also a key element to
hive's
success. I do not want to see us have multiple branches.

On Sat, May 16, 2015 at 1:29 AM, Xuefu Zhang xzh...@cloudera.com wrote:

 Thanks for the explanation, Alan!

 While I have understood more on the proposal, I actually see more
problems
 than the confusion of two lines of releases. Essentially, this proposal
 forces a user to make a hard choice between a stabler, legacy-aware
release
 line and an adventurous, pioneering release line. And once the choice
is
 made, there is no easy way back or forward.

 Here is my interpretation. Let's say we have two main branches as
 proposed. I develop a new feature which I think useful for both
branches.
 So, I commit it to both branches. My feature requires additional schema
 support, so I provide upgrade scripts for both branches. The scripts
are
 different because the two branches have already diverged in schema.

 Now the two branches evolve in a diverging fashion like this. This is
all
 good as long as a user stays in his line. The moment the user considers
a
 switch, mostly likely, from branch-1 to branch-2, he is stuck. Why?
Because
 there is no upgrade path from a release in branch-1 to a release in
 branch-2!

 If we want to provide an upgrade path, then there will be MxN paths,
where
 M and N are the number of releases in the two branches, 

Re: [DISCUSS] Supporting Hadoop-1 and experimental features

2015-05-22 Thread Nick Dimiduk
On Fri, May 22, 2015 at 1:19 PM, Alan Gates alanfga...@gmail.com wrote:

 I see your point on saying the contributor may not understand where best
 to put the patch, and thus the committer decides.  However, it would be
 very disappointing for a contributor who uses branch-1 to build a new
 feature only to have the committer put it only in master.  So I would
 modify your modification to say at the discretion of the contributor and
 Hive committers.


For what its worth, this is more or less how HBase works. All features land
first in master and then percolate backwards to open, active branches
where's it's acceptable to do so. Since our 1.0 release, we're trying to
make 1.0+ follow more closely to semantic versioning. This means that new
features never land in a released minor branch. Bug fixes are applied to
all applicable branches, sometimes this means older release branches and
not master. Sometimes that means contributors are forced to upgrade in
order to take advantage of their contribution in an Apache release (they're
fine to run their own patched builds as they like; it's open source). Right
now we have:

master - (unreleased, development branch for eventual 2.0)
branch-1 - (unreleased, development branch for 1.x series, soon to be
branch basis for 1.2)
branch-1.1 - (released branch, accepting only bug fixes for 1.1.x line)
branch-1.0 - (released branch, accepting only bug fixes for 1.0.x line)

When we're ready, branch-1.2 will fork from branch-1 and branch-1 will
become development branch for 1.3. Eventually we'll decide it's time for
2.0 and master will be branched, creating branch-2. branch-2 will follow
the same process.

We also maintain active branches for 0.98.x and 0.94.x. These branches are
different, following our old model of receiving backward-compatible new
features in .x versions. 0.94 is basically retired now, only getting bug
fixes. 0.94 is only hadoop-1, 0.98 supports both hadoop-1 and hadoop-2
(maybe we've retired hadoop-2 support here in the .12 release?), 1.x
support hadoop-2 only. 2.0 is undecided, but presumably will be hadoop-2
and hadoop-3 if we can extend our shim layer for it.

We have separate release managers for 0.94, 0.98, 1.0, and 1.1, and we're
discussing preparations for 1.2. They enforce commits against their
respective branches.


   kulkarni.swar...@gmail.com
  May 22, 2015 at 11:41
 +1 on the new proposal. Feedback below:

  New features must be put into master.  Whether to put them into
 branch-1 is at the discretion of the developer.

 How about we change this to *All* features must be put into master.
 Whether to put them into branch-1 is at the discretion of the *committer*.
 The reason I think is going forward for us to sustain as a happy and
 healthy community, it's imperative for us to make it not only easy for the
 users, but also for developers and committers to contribute/commit patches.
 To me being a hive contributor would be hard to determine which branch my
 code belongs. Also IMO(and I might be wrong) but many committers have their
 own areas of expertise and it's also very hard for them to immediately
 determine what branch a patch should go to unless very well documented
 somewhere. Putting all code into the master would be an easy approach to
 follow and then cherry picking to other branches can be done. So even if
 people forget to do that, we can always go back to master and port the
 patches out to these branches. So we have a master branch, a branch-1 for
 stable code, branch-2 for experimental and bleeding edge code and so on.
 Once branch-2 is stable, we deprecate branch-1, create branch-3 and move on.

 Another reason I say this is because in my experience, a pretty
 significant amount of work is hive is still bug fixes and I think that is
 what the user cares most about(correctness above anything else). So with
 this approach, might be very obvious to what branches to commit this to.




 --
 Swarnim
Chris Drome cdr...@yahoo-inc.com.INVALID
  May 22, 2015 at 0:49
 I understand the motivation and benefits of creating a branch-2 where more
 disruptive work can go on without affecting branch-1. While not necessarily
 against this approach, from Yahoo's standpoint, I do have some questions
 (concerns).
 Upgrading to a new version of Hive requires a significant commitment of
 time and resources to stabilize and certify a build for deployment to our
 clusters. Given the size of our clusters and scale of datasets, we have to
 be particularly careful about adopting new functionality. However, at the
 same time we are interested in new testing and making available new
 features and functionality. That said, we would have to rely on branch-1
 for the immediate future.
 One concern is that branch-1 would be left to stagnate, at which point
 there would be no option but for users to move to branch-2 as branch-1
 would be effectively end-of-lifed. I'm not sure how long this would take,
 but it would eventually happen as a direct result of the 

[jira] [Created] (HIVE-10804) CBO: Calcite Operator To Hive Operator (Calcite Return Path): optimizer for limit 0 does not work

2015-05-22 Thread Pengcheng Xiong (JIRA)
Pengcheng Xiong created HIVE-10804:
--

 Summary: CBO: Calcite Operator To Hive Operator (Calcite Return 
Path): optimizer for limit 0 does not work
 Key: HIVE-10804
 URL: https://issues.apache.org/jira/browse/HIVE-10804
 Project: Hive
  Issue Type: Sub-task
Reporter: Pengcheng Xiong
Assignee: Pengcheng Xiong


{code}
explain
select key,value from src order by key limit 0
POSTHOOK: type: QUERY
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
Map Reduce
  Map Operator Tree:
  TableScan
alias: src
Statistics: Num rows: 500 Data size: 5312 Basic stats: COMPLETE 
Column stats: NONE
Select Operator
  expressions: key (type: string), value (type: string)
  outputColumnNames: key, value
  Statistics: Num rows: 500 Data size: 5312 Basic stats: COMPLETE 
Column stats: NONE
  Reduce Output Operator
key expressions: key (type: string)
sort order: +
Statistics: Num rows: 500 Data size: 5312 Basic stats: COMPLETE 
Column stats: NONE
value expressions: value (type: string)
  Reduce Operator Tree:
Select Operator
  expressions: KEY.reducesinkkey0 (type: string), VALUE.value (type: 
string)
  outputColumnNames: key, value
  Statistics: Num rows: 500 Data size: 5312 Basic stats: COMPLETE 
Column stats: NONE
  Limit
Number of rows: 0
Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column 
stats: NONE
File Output Operator
  compressed: false
  Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column 
stats: NONE
  table:
  input format: org.apache.hadoop.mapred.TextInputFormat
  output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [DISCUSS] Supporting Hadoop-1 and experimental features

2015-05-22 Thread Alan Gates
I agree with *All* features with the exception that some features might 
be branch-1 specific (if it's a feature on something no longer supported 
in master, like hadoop-1).  Without this we prevent new features for 
older technology, which doesn't strike me as reasonable.


I see your point on saying the contributor may not understand where best 
to put the patch, and thus the committer decides.  However, it would be 
very disappointing for a contributor who uses branch-1 to build a new 
feature only to have the committer put it only in master.  So I would 
modify your modification to say at the discretion of the contributor 
and Hive committers.


Alan.


kulkarni.swar...@gmail.com mailto:kulkarni.swar...@gmail.com
May 22, 2015 at 11:41
+1 on the new proposal. Feedback below:

 New features must be put into master.  Whether to put them into 
branch-1 is at the discretion of the developer.


How about we change this to *_All_* features must be put into master. 
Whether to put them into branch-1 is at the discretion of the 
*_committer_*. The reason I think is going forward for us to sustain 
as a happy and healthy community, it's imperative for us to make it 
not only easy for the users, but also for developers and committers to 
contribute/commit patches. To me being a hive contributor would be 
hard to determine which branch my code belongs. Also IMO(and I might 
be wrong) but many committers have their own areas of expertise and 
it's also very hard for them to immediately determine what branch a 
patch should go to unless very well documented somewhere. Putting all 
code into the master would be an easy approach to follow and then 
cherry picking to other branches can be done. So even if people forget 
to do that, we can always go back to master and port the patches out 
to these branches. So we have a master branch, a branch-1 for stable 
code, branch-2 for experimental and bleeding edge code and so on. 
Once branch-2 is stable, we deprecate branch-1, create branch-3 and 
move on.


Another reason I say this is because in my experience, a pretty 
significant amount of work is hive is still bug fixes and I think that 
is what the user cares most about(correctness above anything else). So 
with this approach, might be very obvious to what branches to commit 
this to.





--
Swarnim
Chris Drome mailto:cdr...@yahoo-inc.com.INVALID
May 22, 2015 at 0:49
I understand the motivation and benefits of creating a branch-2 where 
more disruptive work can go on without affecting branch-1. While not 
necessarily against this approach, from Yahoo's standpoint, I do have 
some questions (concerns).
Upgrading to a new version of Hive requires a significant commitment 
of time and resources to stabilize and certify a build for deployment 
to our clusters. Given the size of our clusters and scale of datasets, 
we have to be particularly careful about adopting new functionality. 
However, at the same time we are interested in new testing and making 
available new features and functionality. That said, we would have to 
rely on branch-1 for the immediate future.
One concern is that branch-1 would be left to stagnate, at which point 
there would be no option but for users to move to branch-2 as branch-1 
would be effectively end-of-lifed. I'm not sure how long this would 
take, but it would eventually happen as a direct result of the very 
reason for creating branch-2.
A related concern is how disruptive the code changes will be in 
branch-2. I imagine that changes in early in branch-2 will be easy to 
backport to branch-1, while this effort will become more difficult, if 
not impractical, as time goes. If the code bases diverge too much then 
this could lead to more pressure for users of branch-1 to add features 
just to branch-1, which has been mentioned as undesirable. By the same 
token, backporting any code in branch-2 will require an increasing 
amount of effort, which contributors to branch-2 may not be interested 
in committing to.
These questions affect us directly because, while we require a certain 
amount of stability, we also like to pull in new functionality that 
will be of value to our users. For example, our current 0.13 release 
is probably closer to 0.14 at this point. Given the lifespan of a 
release, it is often more palatable to backport features and bugfixes 
than to jump to a new version.


The good thing about this proposal is the opportunity to evaluate and 
clean up alot of the old code.

Thanks,
chris



On Monday, May 18, 2015 11:48 AM, Sergey Shelukhin 
ser...@hortonworks.com wrote:



Note: by “cannot” I mean “are unwilling to”; upgrade paths exist, but some
people are set in their ways or have practical considerations and don’t
care for new shiny stuff.





Sergey Shelukhin mailto:ser...@hortonworks.com
May 18, 2015 at 11:47
Note: by “cannot” I mean “are unwilling to”; upgrade paths exist, but some
people are set in their ways or have practical considerations and don’t
care for new shiny 

[jira] [Created] (HIVE-10802) Table join query with some constant field in select fails

2015-05-22 Thread Aihua Xu (JIRA)
Aihua Xu created HIVE-10802:
---

 Summary: Table join query with some constant field in select fails
 Key: HIVE-10802
 URL: https://issues.apache.org/jira/browse/HIVE-10802
 Project: Hive
  Issue Type: Bug
  Components: Query Planning
Affects Versions: 1.2.0
Reporter: Aihua Xu


The following query fails:
{noformat}
create table tb1 (year string, month string);
create table tb2(month string);
select unix_timestamp(a.year) 
from (select * from tb1 where year='2001') a join tb2 b on (a.month=b.month);
{noformat}

with the exception {noformat}
Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:635)
at java.util.ArrayList.get(ArrayList.java:411)
at 
org.apache.hadoop.hive.serde2.objectinspector.StandardStructObjectInspector.init(StandardStructObjectInspector.java:118)
at 
org.apache.hadoop.hive.serde2.objectinspector.StandardStructObjectInspector.init(StandardStructObjectInspector.java:109)
at 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory.getStandardStructObjectInspector(ObjectInspectorFactory.java:290)
at 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory.getStandardStructObjectInspector(ObjectInspectorFactory.java:275)
at 
org.apache.hadoop.hive.ql.exec.CommonJoinOperator.getJoinOutputObjectInspector(CommonJoinOperator.java:175)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-10803) document jdbc url format properly

2015-05-22 Thread Thejas M Nair (JIRA)
Thejas M Nair created HIVE-10803:


 Summary: document jdbc url format properly
 Key: HIVE-10803
 URL: https://issues.apache.org/jira/browse/HIVE-10803
 Project: Hive
  Issue Type: Bug
  Components: Documentation, HiveServer2
Reporter: Thejas M Nair


This is the format of the HS2 string, this needs to be documented in the wiki 
doc (taken from jdbc.Utils.java)
 
jdbc:hive2://host1:port1,host2:port2/dbName;sess_var_list?hive_conf_list#hive_var_list




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-10805) OOM in vectorized reduce

2015-05-22 Thread Matt McCline (JIRA)
Matt McCline created HIVE-10805:
---

 Summary: OOM in vectorized reduce
 Key: HIVE-10805
 URL: https://issues.apache.org/jira/browse/HIVE-10805
 Project: Hive
  Issue Type: Bug
Reporter: Matt McCline
Assignee: Matt McCline
Priority: Blocker
 Fix For: 1.2.1


Vectorized reduce does not release scratch byte space in BytesColumnVectors and 
runs out of memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-10806) Incorrect example for exploding map function in hive wiki

2015-05-22 Thread anup b (JIRA)
anup b created HIVE-10806:
-

 Summary: Incorrect example for exploding map function in hive wiki
 Key: HIVE-10806
 URL: https://issues.apache.org/jira/browse/HIVE-10806
 Project: Hive
  Issue Type: Bug
  Components: Documentation
Affects Versions: 0.10.0
Reporter: anup b
Priority: Trivial


In hive wiki, example for exploding map is wrong it doesnt work in hive 0.10

Example given in wiki which doesnt work:

SELECT explode(myMap) AS myMapKey, myMapValue FROM myMapTable;

It should be updated to :
SELECT explode(myMap) AS (myMapKey, myMapValue) FROM myMapTable;

Link : 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-explode



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-10809) HCat FileOutputCommitterContainer leaves behind empty _SCRATCH directories

2015-05-22 Thread Selina Zhang (JIRA)
Selina Zhang created HIVE-10809:
---

 Summary: HCat FileOutputCommitterContainer leaves behind empty 
_SCRATCH directories
 Key: HIVE-10809
 URL: https://issues.apache.org/jira/browse/HIVE-10809
 Project: Hive
  Issue Type: Bug
  Components: HCatalog
Affects Versions: 1.2.0
Reporter: Selina Zhang
Assignee: Selina Zhang


When static partition is added through HCatStorer or HCatWriter

{code}
JoinedData = LOAD '/user/selinaz/data/part-r-0' USING JsonLoader();
STORE JoinedData INTO 'selina.joined_events_e' USING 
org.apache.hive.hcatalog.pig.HCatStorer('author=selina');
{code}

The table directory looks like
{noformat}
drwx--   - selinaz users  0 2015-05-22 21:19 
/user/selinaz/joined_events_e/_SCRATCH0.9157208938193798
drwx--   - selinaz users  0 2015-05-22 21:19 
/user/selinaz/joined_events_e/author=selina
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-10808) Inner join on Null throwing Cast Exception

2015-05-22 Thread Naveen Gangam (JIRA)
Naveen Gangam created HIVE-10808:


 Summary: Inner join on Null throwing Cast Exception
 Key: HIVE-10808
 URL: https://issues.apache.org/jira/browse/HIVE-10808
 Project: Hive
  Issue Type: Bug
  Components: HiveServer2
Affects Versions: 0.13.1
Reporter: Naveen Gangam
Assignee: Naveen Gangam
Priority: Critical


select
 a.col1,
 a.col2,
 a.col3,
 a.col4
 from
 tab1 a
 inner join
 (
 select
 max(x) as x
 from
 tab1
 where
 x  20130327
 ) r
 on
 a.x = r.x
 where
 a.col1 = 'F'
 and a.col3 in ('A', 'S', 'G');

Failed Task log snippet:

2015-05-18 19:22:17,372 INFO [main] 
org.apache.hadoop.hive.ql.exec.mr.ObjectCache: Ignoring retrieval request: 
__MAP_PLAN__
2015-05-18 19:22:17,372 INFO [main] 
org.apache.hadoop.hive.ql.exec.mr.ObjectCache: Ignoring cache key: __MAP_PLAN__
2015-05-18 19:22:17,457 WARN [main] org.apache.hadoop.mapred.YarnChild: 
Exception running child : java.lang.RuntimeException: Error in configuring 
object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:446)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
... 9 more
Caused by: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:38)
... 14 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
... 17 more
Caused by: java.lang.RuntimeException: Map operator initialization failed
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.configure(ExecMapper.java:157)
... 22 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
java.lang.ClassCastException: 
org.apache.hadoop.hive.serde2.NullStructSerDe$NullStructSerDeObjectInspector 
cannot be cast to 
org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector
at 
org.apache.hadoop.hive.ql.exec.MapOperator.getConvertedOI(MapOperator.java:334)
at org.apache.hadoop.hive.ql.exec.MapOperator.setChildren(MapOperator.java:352)
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.configure(ExecMapper.java:126)
... 22 more
Caused by: java.lang.ClassCastException: 
org.apache.hadoop.hive.serde2.NullStructSerDe$NullStructSerDeObjectInspector 
cannot be cast to 
org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector
at 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.isInstanceOfSettableOI(ObjectInspectorUtils.java:)
at 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.hasAllFieldsSettable(ObjectInspectorUtils.java:1149)
at 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters.getConvertedOI(ObjectInspectorConverters.java:219)
at 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters.getConvertedOI(ObjectInspectorConverters.java:183)
at 
org.apache.hadoop.hive.ql.exec.MapOperator.getConvertedOI(MapOperator.java:316)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-10807) Invalidate basic stats for insert queries if autogather=false

2015-05-22 Thread Ashutosh Chauhan (JIRA)
Ashutosh Chauhan created HIVE-10807:
---

 Summary: Invalidate basic stats for insert queries if 
autogather=false
 Key: HIVE-10807
 URL: https://issues.apache.org/jira/browse/HIVE-10807
 Project: Hive
  Issue Type: Bug
  Components: Statistics
Affects Versions: 1.2.0
Reporter: Gopal V
Assignee: Ashutosh Chauhan


if stats.autogather=false leads to incorrect basic stats in case of insert 
statements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-10801) 'drop view' fails throwing java.lang.NullPointerException

2015-05-22 Thread Hari Sankar Sivarama Subramaniyan (JIRA)
Hari Sankar Sivarama Subramaniyan created HIVE-10801:


 Summary: 'drop view' fails throwing java.lang.NullPointerException
 Key: HIVE-10801
 URL: https://issues.apache.org/jira/browse/HIVE-10801
 Project: Hive
  Issue Type: Bug
Reporter: Hari Sankar Sivarama Subramaniyan
Assignee: Hari Sankar Sivarama Subramaniyan


When trying to drop a view, hive log shows:
{code}
2015-05-21 11:53:06,126 ERROR [HiveServer2-Background-Pool: Thread-197]: 
hdfs.KeyProviderCache (KeyProviderCache.java:createKeyProviderURI(87)) - Could 
not find uri with key [dfs.encryption.key.provider.uri] to create a keyProvider 
!!
2015-05-21 11:53:06,134 ERROR [HiveServer2-Background-Pool: Thread-197]: 
metastore.RetryingHMSHandler (RetryingHMSHandler.java:invoke(155)) - 
MetaException(message:java.lang.NullPointerException)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newMetaException(HiveMetaStore.java:5379)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.drop_table_with_environment_context(HiveMetaStore.java:1734)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107)
at com.sun.proxy.$Proxy7.drop_table_with_environment_context(Unknown 
Source)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.drop_table_with_environment_context(HiveMetaStoreClient.java:2056)
at 
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.drop_table_with_environment_context(SessionHiveMetaStoreClient.java:118)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.dropTable(HiveMetaStoreClient.java:968)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.dropTable(HiveMetaStoreClient.java:904)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156)
at com.sun.proxy.$Proxy8.dropTable(Unknown Source)
at org.apache.hadoop.hive.ql.metadata.Hive.dropTable(Hive.java:1035)
at org.apache.hadoop.hive.ql.metadata.Hive.dropTable(Hive.java:972)
at org.apache.hadoop.hive.ql.exec.DDLTask.dropTable(DDLTask.java:3836)
at 
org.apache.hadoop.hive.ql.exec.DDLTask.dropTableOrPartitions(DDLTask.java:3692)
at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:331)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:88)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1650)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1409)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1192)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1059)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1054)
at 
org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:154)
at 
org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:71)
at 
org.apache.hive.service.cli.operation.SQLOperation$1$1.run(SQLOperation.java:206)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at 
org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:218)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at 
org.apache.hadoop.hive.shims.Hadoop23Shims$HdfsEncryptionShim.isPathEncrypted(Hadoop23Shims.java:1213)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.drop_table_core(HiveMetaStore.java:1546)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.drop_table_with_environment_context(HiveMetaStore.java:1723)
... 40 more


Build failed in Jenkins: HIVE-TRUNK-JAVA8 #72

2015-05-22 Thread hiveqa
See 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/HIVE-TRUNK-JAVA8/72/

--
Started by timer
Building in workspace 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/HIVE-TRUNK-JAVA8/ws/
  git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
  git config remote.origin.url 
  https://git-wip-us.apache.org/repos/asf/hive.git # timeout=10
Fetching upstream changes from https://git-wip-us.apache.org/repos/asf/hive.git
  git --version # timeout=10
  git fetch --tags --progress https://git-wip-us.apache.org/repos/asf/hive.git 
  +refs/heads/*:refs/remotes/origin/*
ERROR: Error fetching remote repo 'origin'
ERROR: Error fetching remote repo 'origin'
Archiving artifacts
Recording test results


Re: Review Request 34593: HIVE-10702 COUNT(*) over windowing 'x preceding and y preceding' doesn't work properly

2015-05-22 Thread Aihua Xu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34593/
---

(Updated May 22, 2015, 11:57 a.m.)


Review request for hive.


Repository: hive-git


Description (updated)
---

HIVE-10702 COUNT(*) over windowing 'x preceding and y preceding' doesn't work 
properly


Diffs
-

  ql/src/java/org/apache/hadoop/hive/ql/exec/PTFRollingPartition.java 
e195c0a2815687ded15d186cfe6279fdbc212819 
  ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/WindowingTableFunction.java 
d7817d90dce7c851affdf35aff65ce3de259c866 
  ql/src/test/queries/clientpositive/windowing_windowspec2.q 
3e8aa93494c0ad9119f475deca9edef74beb8a46 
  ql/src/test/results/clientpositive/windowing_windowspec2.q.out 
0879344a2364532c53ffc697ea402d99701d3723 

Diff: https://reviews.apache.org/r/34593/diff/


Testing
---

Test has been done here https://issues.apache.org/jira/browse/HIVE-10702

Seems one test failed for unrelated reason.


Thanks,

Aihua Xu



Review Request 34593: HIVE-10702 COUNT(*) over windowing 'x preceding and y preceding' doesn't work properly

2015-05-22 Thread Aihua Xu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34593/
---

Review request for hive.


Repository: hive-git


Description
---

HIVE-10702 COUNT(*) over windowing 'x preceding and y preceding' doesn't work 
properly


Diffs
-

  ql/src/java/org/apache/hadoop/hive/ql/exec/PTFRollingPartition.java 
e195c0a2815687ded15d186cfe6279fdbc212819 
  ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/WindowingTableFunction.java 
d7817d90dce7c851affdf35aff65ce3de259c866 
  ql/src/test/queries/clientpositive/windowing_windowspec2.q 
3e8aa93494c0ad9119f475deca9edef74beb8a46 
  ql/src/test/results/clientpositive/windowing_windowspec2.q.out 
0879344a2364532c53ffc697ea402d99701d3723 

Diff: https://reviews.apache.org/r/34593/diff/


Testing
---

Test has been done here https://issues.apache.org/jira/browse/HIVE-10702

Seems one test failed for unrelated reason.


Thanks,

Aihua Xu



[jira] [Created] (HIVE-10800) CBO (Calcite Return Path): Setup correct information if CBO succeeds

2015-05-22 Thread Jesus Camacho Rodriguez (JIRA)
Jesus Camacho Rodriguez created HIVE-10800:
--

 Summary: CBO (Calcite Return Path): Setup correct information if 
CBO succeeds
 Key: HIVE-10800
 URL: https://issues.apache.org/jira/browse/HIVE-10800
 Project: Hive
  Issue Type: Sub-task
Reporter: Jesus Camacho Rodriguez
Assignee: Jesus Camacho Rodriguez






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [DISCUSS] Supporting Hadoop-1 and experimental features

2015-05-22 Thread Sergey Shelukhin
I think branch-2 doesn’t need to be framed as particularly adventurous
(other than due to general increase of the amount of work done in Hive by
community).
All the new features that normally go on trunk/master will go to branch-2.
branch-2 is just trunk as it is now, in fact there will be no branch-2,
just master :) The difference is the dropped functionality, not added one.
So you shouldn’t lose stability if you retain the same process as now by
just staying on versions off master.

Perhaps, as is usually the case in Apache projects, developing features on
older branches would be discouraged. Right now, all features usually go on
trunk/master, and are then back ported as needed and practical; so you
wouldn’t (in Apache) make a feature on Hive 0.14 to be released in 0.14.N,
and not back port to master.

On 15/5/22, 00:49, Chris Drome cdr...@yahoo-inc.com.INVALID wrote:

I understand the motivation and benefits of creating a branch-2 where
more disruptive work can go on without affecting branch-1. While not
necessarily against this approach, from Yahoo's standpoint, I do have
some questions (concerns).
Upgrading to a new version of Hive requires a significant commitment of
time and resources to stabilize and certify a build for deployment to our
clusters. Given the size of our clusters and scale of datasets, we have
to be particularly careful about adopting new functionality. However, at
the same time we are interested in new testing and making available new
features and functionality. That said, we would have to rely on branch-1
for the immediate future.
One concern is that branch-1 would be left to stagnate, at which point
there would be no option but for users to move to branch-2 as branch-1
would be effectively end-of-lifed. I'm not sure how long this would take,
but it would eventually happen as a direct result of the very reason for
creating branch-2.
A related concern is how disruptive the code changes will be in branch-2.
I imagine that changes in early in branch-2 will be easy to backport to
branch-1, while this effort will become more difficult, if not
impractical, as time goes. If the code bases diverge too much then this
could lead to more pressure for users of branch-1 to add features just to
branch-1, which has been mentioned as undesirable. By the same token,
backporting any code in branch-2 will require an increasing amount of
effort, which contributors to branch-2 may not be interested in
committing to.
These questions affect us directly because, while we require a certain
amount of stability, we also like to pull in new functionality that will
be of value to our users. For example, our current 0.13 release is
probably closer to 0.14 at this point. Given the lifespan of a release,
it is often more palatable to backport features and bugfixes than to jump
to a new version.

The good thing about this proposal is the opportunity to evaluate and
clean up alot of the old code.
Thanks,
chris
 


 On Monday, May 18, 2015 11:48 AM, Sergey Shelukhin
ser...@hortonworks.com wrote:
   

 Note: by “cannot” I mean “are unwilling to”; upgrade paths exist, but
some
people are set in their ways or have practical considerations and don’t
care for new shiny stuff.

On 15/5/18, 11:46, Sergey Shelukhin ser...@hortonworks.com wrote:

I think we need some path for deprecating old Hadoop versions, the same
way we deprecate old Java version support or old RDBMS version support.
At some point the cost of supporting Hadoop 1 exceeds the benefit. Same
goes for stuff like MR; supporting it, esp. for perf work, becomes a
burden, and it’s outdated with 2 alternatives, one of which has been
around for 2 releases.
The branches are a graceful way to get rid of the legacy burden.

Alternatively, when sweeping changes are made, we can do what Hbase did
(which is not pretty imho), where 0.94 version had ~30 dot releases
because people cannot upgrade to 0.96 “singularity” release.


I posit that people who run Hadoop 1 and MR at this day and age (and more
so as time passes) are people who either don’t care about perf and new
features, only stability; so, stability-focused branch would be perfect
to
support them.


On 15/5/18, 10:04, Edward Capriolo edlinuxg...@gmail.com wrote:

Up until recently Hive supported numerous versions of Hadoop code base
with
a simple shim layer. I would rather we stick to the shim layer. I think
this was easily the best part about hive was that a single release
worked
well regardless of your hadoop version. It was also a key element to
hive's
success. I do not want to see us have multiple branches.

On Sat, May 16, 2015 at 1:29 AM, Xuefu Zhang xzh...@cloudera.com
wrote:

 Thanks for the explanation, Alan!

 While I have understood more on the proposal, I actually see more
problems
 than the confusion of two lines of releases. Essentially, this
proposal
 forces a user to make a hard choice between a stabler, legacy-aware
release
 line and an adventurous, pioneering release line. And once the 

Re: [DISCUSS] Supporting Hadoop-1 and experimental features

2015-05-22 Thread kulkarni.swar...@gmail.com
+1 on the new proposal. Feedback below:

 New features must be put into master.  Whether to put them into branch-1
is at the discretion of the developer.

How about we change this to *All* features must be put into master.
Whether to put them into branch-1 is at the discretion of the *committer*.
The reason I think is going forward for us to sustain as a happy and
healthy community, it's imperative for us to make it not only easy for the
users, but also for developers and committers to contribute/commit patches.
To me being a hive contributor would be hard to determine which branch my
code belongs. Also IMO(and I might be wrong) but many committers have their
own areas of expertise and it's also very hard for them to immediately
determine what branch a patch should go to unless very well documented
somewhere. Putting all code into the master would be an easy approach to
follow and then cherry picking to other branches can be done. So even if
people forget to do that, we can always go back to master and port the
patches out to these branches. So we have a master branch, a branch-1 for
stable code, branch-2 for experimental and bleeding edge code and so on.
Once branch-2 is stable, we deprecate branch-1, create branch-3 and move on.

Another reason I say this is because in my experience, a pretty significant
amount of work is hive is still bug fixes and I think that is what the user
cares most about(correctness above anything else). So with this approach,
might be very obvious to what branches to commit this to.

On Fri, May 22, 2015 at 1:11 PM, Alan Gates alanfga...@gmail.com wrote:

 Thanks for your feedback Chris.  It sounds like there are a couple of
 reasonable concerns being voiced repeatedly:
 1) Fragmentation, the two branches will drift too far apart.
 2) Stagnation, branch-1 will effectively become a dead-end.

 So I modify the proposal as follows to deal with those:

 1) New features must be put into master.  Whether to put them into
 branch-1 is at the discretion of the developer.  The exception would be
 features that would not apply in master (e.g. say someone developed a way
 to double the speed of map reduce jobs Hive produces).  For example, I
 might choose to put the materialized view work I'm doing in both branch-1
 and master, but the HBase metastore work only in master.  This should avoid
 fragmentation by keeping branch-1 a subset of master.

 2) For the next 12 months we will port critical bug fixes (crashes,
 security issues, wrong results) to branch-1 as well as fixing them on
 master.  We might choose to lengthen this time depending on how stable
 master is and how fast the uptake is.  This avoids branch-1 being
 immediately abandoned by developers while users are still depending on it.

 Alan.

   Chris Drome cdr...@yahoo-inc.com.INVALID
  May 22, 2015 at 0:49
 I understand the motivation and benefits of creating a branch-2 where more
 disruptive work can go on without affecting branch-1. While not necessarily
 against this approach, from Yahoo's standpoint, I do have some questions
 (concerns).
 Upgrading to a new version of Hive requires a significant commitment of
 time and resources to stabilize and certify a build for deployment to our
 clusters. Given the size of our clusters and scale of datasets, we have to
 be particularly careful about adopting new functionality. However, at the
 same time we are interested in new testing and making available new
 features and functionality. That said, we would have to rely on branch-1
 for the immediate future.
 One concern is that branch-1 would be left to stagnate, at which point
 there would be no option but for users to move to branch-2 as branch-1
 would be effectively end-of-lifed. I'm not sure how long this would take,
 but it would eventually happen as a direct result of the very reason for
 creating branch-2.
 A related concern is how disruptive the code changes will be in branch-2.
 I imagine that changes in early in branch-2 will be easy to backport to
 branch-1, while this effort will become more difficult, if not impractical,
 as time goes. If the code bases diverge too much then this could lead to
 more pressure for users of branch-1 to add features just to branch-1, which
 has been mentioned as undesirable. By the same token, backporting any code
 in branch-2 will require an increasing amount of effort, which contributors
 to branch-2 may not be interested in committing to.
 These questions affect us directly because, while we require a certain
 amount of stability, we also like to pull in new functionality that will be
 of value to our users. For example, our current 0.13 release is probably
 closer to 0.14 at this point. Given the lifespan of a release, it is often
 more palatable to backport features and bugfixes than to jump to a new
 version.

 The good thing about this proposal is the opportunity to evaluate and
 clean up alot of the old code.
 Thanks,
 chris



 On Monday, May 18, 2015 11:48 AM, Sergey Shelukhin
 

Re: [DISCUSS] Supporting Hadoop-1 and experimental features

2015-05-22 Thread Alexander Pivovarov
Looks like we discussing 3 options:

1. Support hadoop 1, 2 and 3 in master branch.

2. Support hadoop 1 in branch-1, hadoop 2 in branch-2, hadoop 3 in branch-3

3. Support hadoop 2 and 3 in master

I DO not think option 2 is good solution because it is much more difficuilt
to manage 3 active prod branches rather than one master branch.

I think we should go with options 1 or 3.

+1 on Xuefu and Edward opinion
On May 22, 2015 9:09 AM, Sergey Shelukhin ser...@hortonworks.com wrote:

 I think branch-2 doesn’t need to be framed as particularly adventurous
 (other than due to general increase of the amount of work done in Hive by
 community).
 All the new features that normally go on trunk/master will go to branch-2.
 branch-2 is just trunk as it is now, in fact there will be no branch-2,
 just master :) The difference is the dropped functionality, not added one.
 So you shouldn’t lose stability if you retain the same process as now by
 just staying on versions off master.

 Perhaps, as is usually the case in Apache projects, developing features on
 older branches would be discouraged. Right now, all features usually go on
 trunk/master, and are then back ported as needed and practical; so you
 wouldn’t (in Apache) make a feature on Hive 0.14 to be released in 0.14.N,
 and not back port to master.

 On 15/5/22, 00:49, Chris Drome cdr...@yahoo-inc.com.INVALID wrote:

 I understand the motivation and benefits of creating a branch-2 where
 more disruptive work can go on without affecting branch-1. While not
 necessarily against this approach, from Yahoo's standpoint, I do have
 some questions (concerns).
 Upgrading to a new version of Hive requires a significant commitment of
 time and resources to stabilize and certify a build for deployment to our
 clusters. Given the size of our clusters and scale of datasets, we have
 to be particularly careful about adopting new functionality. However, at
 the same time we are interested in new testing and making available new
 features and functionality. That said, we would have to rely on branch-1
 for the immediate future.
 One concern is that branch-1 would be left to stagnate, at which point
 there would be no option but for users to move to branch-2 as branch-1
 would be effectively end-of-lifed. I'm not sure how long this would take,
 but it would eventually happen as a direct result of the very reason for
 creating branch-2.
 A related concern is how disruptive the code changes will be in branch-2.
 I imagine that changes in early in branch-2 will be easy to backport to
 branch-1, while this effort will become more difficult, if not
 impractical, as time goes. If the code bases diverge too much then this
 could lead to more pressure for users of branch-1 to add features just to
 branch-1, which has been mentioned as undesirable. By the same token,
 backporting any code in branch-2 will require an increasing amount of
 effort, which contributors to branch-2 may not be interested in
 committing to.
 These questions affect us directly because, while we require a certain
 amount of stability, we also like to pull in new functionality that will
 be of value to our users. For example, our current 0.13 release is
 probably closer to 0.14 at this point. Given the lifespan of a release,
 it is often more palatable to backport features and bugfixes than to jump
 to a new version.
 
 The good thing about this proposal is the opportunity to evaluate and
 clean up alot of the old code.
 Thanks,
 chris
 
 
 
  On Monday, May 18, 2015 11:48 AM, Sergey Shelukhin
 ser...@hortonworks.com wrote:
 
 
  Note: by “cannot” I mean “are unwilling to”; upgrade paths exist, but
 some
 people are set in their ways or have practical considerations and don’t
 care for new shiny stuff.
 
 On 15/5/18, 11:46, Sergey Shelukhin ser...@hortonworks.com wrote:
 
 I think we need some path for deprecating old Hadoop versions, the same
 way we deprecate old Java version support or old RDBMS version support.
 At some point the cost of supporting Hadoop 1 exceeds the benefit. Same
 goes for stuff like MR; supporting it, esp. for perf work, becomes a
 burden, and it’s outdated with 2 alternatives, one of which has been
 around for 2 releases.
 The branches are a graceful way to get rid of the legacy burden.
 
 Alternatively, when sweeping changes are made, we can do what Hbase did
 (which is not pretty imho), where 0.94 version had ~30 dot releases
 because people cannot upgrade to 0.96 “singularity” release.
 
 
 I posit that people who run Hadoop 1 and MR at this day and age (and more
 so as time passes) are people who either don’t care about perf and new
 features, only stability; so, stability-focused branch would be perfect
 to
 support them.
 
 
 On 15/5/18, 10:04, Edward Capriolo edlinuxg...@gmail.com wrote:
 
 Up until recently Hive supported numerous versions of Hadoop code base
 with
 a simple shim layer. I would rather we stick to the shim layer. I think
 this was easily the best part about 

Re: [DISCUSS] Supporting Hadoop-1 and experimental features

2015-05-22 Thread Alexander Pivovarov
Alan, your email client is not compatible with gmail viewer. For some
reason your reply contains the whole thread of the discussion
On May 22, 2015 10:58 AM, Alan Gates alanfga...@gmail.com wrote:

 I don't think anyone is advocating for option 2, as that would be
 disastrous.  Option 3 is closest to what I'm proposing, though again
 dropping support for Hadoop 1 is only a part of it.

 Alan.

   Alexander Pivovarov apivova...@gmail.com
  May 22, 2015 at 10:03
 Looks like we discussing 3 options:

 1. Support hadoop 1, 2 and 3 in master branch.

 2. Support hadoop 1 in branch-1, hadoop 2 in branch-2, hadoop 3 in branch-3

 3. Support hadoop 2 and 3 in master

 I DO not think option 2 is good solution because it is much more difficuilt
 to manage 3 active prod branches rather than one master branch.

 I think we should go with options 1 or 3.

 +1 on Xuefu and Edward opinion

   Sergey Shelukhin ser...@hortonworks.com
  May 22, 2015 at 9:08
 I think branch-2 doesn’t need to be framed as particularly adventurous
 (other than due to general increase of the amount of work done in Hive by
 community).
 All the new features that normally go on trunk/master will go to branch-2.
 branch-2 is just trunk as it is now, in fact there will be no branch-2,
 just master :) The difference is the dropped functionality, not added one.
 So you shouldn’t lose stability if you retain the same process as now by
 just staying on versions off master.

 Perhaps, as is usually the case in Apache projects, developing features on
 older branches would be discouraged. Right now, all features usually go on
 trunk/master, and are then back ported as needed and practical; so you
 wouldn’t (in Apache) make a feature on Hive 0.14 to be released in 0.14.N,
 and not back port to master.


   Chris Drome cdr...@yahoo-inc.com.INVALID
  May 22, 2015 at 0:49
 I understand the motivation and benefits of creating a branch-2 where more
 disruptive work can go on without affecting branch-1. While not necessarily
 against this approach, from Yahoo's standpoint, I do have some questions
 (concerns).
 Upgrading to a new version of Hive requires a significant commitment of
 time and resources to stabilize and certify a build for deployment to our
 clusters. Given the size of our clusters and scale of datasets, we have to
 be particularly careful about adopting new functionality. However, at the
 same time we are interested in new testing and making available new
 features and functionality. That said, we would have to rely on branch-1
 for the immediate future.
 One concern is that branch-1 would be left to stagnate, at which point
 there would be no option but for users to move to branch-2 as branch-1
 would be effectively end-of-lifed. I'm not sure how long this would take,
 but it would eventually happen as a direct result of the very reason for
 creating branch-2.
 A related concern is how disruptive the code changes will be in branch-2.
 I imagine that changes in early in branch-2 will be easy to backport to
 branch-1, while this effort will become more difficult, if not impractical,
 as time goes. If the code bases diverge too much then this could lead to
 more pressure for users of branch-1 to add features just to branch-1, which
 has been mentioned as undesirable. By the same token, backporting any code
 in branch-2 will require an increasing amount of effort, which contributors
 to branch-2 may not be interested in committing to.
 These questions affect us directly because, while we require a certain
 amount of stability, we also like to pull in new functionality that will be
 of value to our users. For example, our current 0.13 release is probably
 closer to 0.14 at this point. Given the lifespan of a release, it is often
 more palatable to backport features and bugfixes than to jump to a new
 version.

 The good thing about this proposal is the opportunity to evaluate and
 clean up alot of the old code.
 Thanks,
 chris



 On Monday, May 18, 2015 11:48 AM, Sergey Shelukhin
 ser...@hortonworks.com ser...@hortonworks.com wrote:


 Note: by “cannot” I mean “are unwilling to”; upgrade paths exist, but some
 people are set in their ways or have practical considerations and don’t
 care for new shiny stuff.





   Sergey Shelukhin ser...@hortonworks.com
  May 18, 2015 at 11:47
 Note: by “cannot” I mean “are unwilling to”; upgrade paths exist, but some
 people are set in their ways or have practical considerations and don’t
 care for new shiny stuff.


   Sergey Shelukhin ser...@hortonworks.com
  May 18, 2015 at 11:46
 I think we need some path for deprecating old Hadoop versions, the same
 way we deprecate old Java version support or old RDBMS version support.
 At some point the cost of supporting Hadoop 1 exceeds the benefit. Same
 goes for stuff like MR; supporting it, esp. for perf work, becomes a
 burden, and it’s outdated with 2 alternatives, one of which has been
 around for 2 releases.
 The branches are a graceful way to get rid of the 

Re: [DISCUSS] Supporting Hadoop-1 and experimental features

2015-05-22 Thread Alan Gates
I don't think anyone is advocating for option 2, as that would be 
disastrous.  Option 3 is closest to what I'm proposing, though again 
dropping support for Hadoop 1 is only a part of it.


Alan.


Alexander Pivovarov mailto:apivova...@gmail.com
May 22, 2015 at 10:03
Looks like we discussing 3 options:

1. Support hadoop 1, 2 and 3 in master branch.

2. Support hadoop 1 in branch-1, hadoop 2 in branch-2, hadoop 3 in 
branch-3


3. Support hadoop 2 and 3 in master

I DO not think option 2 is good solution because it is much more 
difficuilt

to manage 3 active prod branches rather than one master branch.

I think we should go with options 1 or 3.

+1 on Xuefu and Edward opinion

Sergey Shelukhin mailto:ser...@hortonworks.com
May 22, 2015 at 9:08
I think branch-2 doesn’t need to be framed as particularly adventurous
(other than due to general increase of the amount of work done in Hive by
community).
All the new features that normally go on trunk/master will go to branch-2.
branch-2 is just trunk as it is now, in fact there will be no branch-2,
just master :) The difference is the dropped functionality, not added one.
So you shouldn’t lose stability if you retain the same process as now by
just staying on versions off master.

Perhaps, as is usually the case in Apache projects, developing features on
older branches would be discouraged. Right now, all features usually go on
trunk/master, and are then back ported as needed and practical; so you
wouldn’t (in Apache) make a feature on Hive 0.14 to be released in 0.14.N,
and not back port to master.


Chris Drome mailto:cdr...@yahoo-inc.com.INVALID
May 22, 2015 at 0:49
I understand the motivation and benefits of creating a branch-2 where 
more disruptive work can go on without affecting branch-1. While not 
necessarily against this approach, from Yahoo's standpoint, I do have 
some questions (concerns).
Upgrading to a new version of Hive requires a significant commitment 
of time and resources to stabilize and certify a build for deployment 
to our clusters. Given the size of our clusters and scale of datasets, 
we have to be particularly careful about adopting new functionality. 
However, at the same time we are interested in new testing and making 
available new features and functionality. That said, we would have to 
rely on branch-1 for the immediate future.
One concern is that branch-1 would be left to stagnate, at which point 
there would be no option but for users to move to branch-2 as branch-1 
would be effectively end-of-lifed. I'm not sure how long this would 
take, but it would eventually happen as a direct result of the very 
reason for creating branch-2.
A related concern is how disruptive the code changes will be in 
branch-2. I imagine that changes in early in branch-2 will be easy to 
backport to branch-1, while this effort will become more difficult, if 
not impractical, as time goes. If the code bases diverge too much then 
this could lead to more pressure for users of branch-1 to add features 
just to branch-1, which has been mentioned as undesirable. By the same 
token, backporting any code in branch-2 will require an increasing 
amount of effort, which contributors to branch-2 may not be interested 
in committing to.
These questions affect us directly because, while we require a certain 
amount of stability, we also like to pull in new functionality that 
will be of value to our users. For example, our current 0.13 release 
is probably closer to 0.14 at this point. Given the lifespan of a 
release, it is often more palatable to backport features and bugfixes 
than to jump to a new version.


The good thing about this proposal is the opportunity to evaluate and 
clean up alot of the old code.

Thanks,
chris



On Monday, May 18, 2015 11:48 AM, Sergey Shelukhin 
ser...@hortonworks.com wrote:



Note: by “cannot” I mean “are unwilling to”; upgrade paths exist, but some
people are set in their ways or have practical considerations and don’t
care for new shiny stuff.





Sergey Shelukhin mailto:ser...@hortonworks.com
May 18, 2015 at 11:47
Note: by “cannot” I mean “are unwilling to”; upgrade paths exist, but some
people are set in their ways or have practical considerations and don’t
care for new shiny stuff.


Sergey Shelukhin mailto:ser...@hortonworks.com
May 18, 2015 at 11:46
I think we need some path for deprecating old Hadoop versions, the same
way we deprecate old Java version support or old RDBMS version support.
At some point the cost of supporting Hadoop 1 exceeds the benefit. Same
goes for stuff like MR; supporting it, esp. for perf work, becomes a
burden, and it’s outdated with 2 alternatives, one of which has been
around for 2 releases.
The branches are a graceful way to get rid of the legacy burden.

Alternatively, when sweeping changes are made, we can do what Hbase did
(which is not pretty imho), where 0.94 version had ~30 dot releases
because people cannot upgrade to 0.96 “singularity” release.


I posit that people who run 

Re: [DISCUSS] Supporting Hadoop-1 and experimental features

2015-05-22 Thread Alan Gates
Thanks for your feedback Chris.  It sounds like there are a couple of 
reasonable concerns being voiced repeatedly:

1) Fragmentation, the two branches will drift too far apart.
2) Stagnation, branch-1 will effectively become a dead-end.

So I modify the proposal as follows to deal with those:

1) New features must be put into master.  Whether to put them into 
branch-1 is at the discretion of the developer.  The exception would be 
features that would not apply in master (e.g. say someone developed a 
way to double the speed of map reduce jobs Hive produces).  For example, 
I might choose to put the materialized view work I'm doing in both 
branch-1 and master, but the HBase metastore work only in master.  This 
should avoid fragmentation by keeping branch-1 a subset of master.


2) For the next 12 months we will port critical bug fixes (crashes, 
security issues, wrong results) to branch-1 as well as fixing them on 
master.  We might choose to lengthen this time depending on how stable 
master is and how fast the uptake is.  This avoids branch-1 being 
immediately abandoned by developers while users are still depending on it.


Alan.


Chris Drome mailto:cdr...@yahoo-inc.com.INVALID
May 22, 2015 at 0:49
I understand the motivation and benefits of creating a branch-2 where 
more disruptive work can go on without affecting branch-1. While not 
necessarily against this approach, from Yahoo's standpoint, I do have 
some questions (concerns).
Upgrading to a new version of Hive requires a significant commitment 
of time and resources to stabilize and certify a build for deployment 
to our clusters. Given the size of our clusters and scale of datasets, 
we have to be particularly careful about adopting new functionality. 
However, at the same time we are interested in new testing and making 
available new features and functionality. That said, we would have to 
rely on branch-1 for the immediate future.
One concern is that branch-1 would be left to stagnate, at which point 
there would be no option but for users to move to branch-2 as branch-1 
would be effectively end-of-lifed. I'm not sure how long this would 
take, but it would eventually happen as a direct result of the very 
reason for creating branch-2.
A related concern is how disruptive the code changes will be in 
branch-2. I imagine that changes in early in branch-2 will be easy to 
backport to branch-1, while this effort will become more difficult, if 
not impractical, as time goes. If the code bases diverge too much then 
this could lead to more pressure for users of branch-1 to add features 
just to branch-1, which has been mentioned as undesirable. By the same 
token, backporting any code in branch-2 will require an increasing 
amount of effort, which contributors to branch-2 may not be interested 
in committing to.
These questions affect us directly because, while we require a certain 
amount of stability, we also like to pull in new functionality that 
will be of value to our users. For example, our current 0.13 release 
is probably closer to 0.14 at this point. Given the lifespan of a 
release, it is often more palatable to backport features and bugfixes 
than to jump to a new version.


The good thing about this proposal is the opportunity to evaluate and 
clean up alot of the old code.

Thanks,
chris



On Monday, May 18, 2015 11:48 AM, Sergey Shelukhin 
ser...@hortonworks.com wrote:



Note: by “cannot” I mean “are unwilling to”; upgrade paths exist, but some
people are set in their ways or have practical considerations and don’t
care for new shiny stuff.





Sergey Shelukhin mailto:ser...@hortonworks.com
May 18, 2015 at 11:47
Note: by “cannot” I mean “are unwilling to”; upgrade paths exist, but some
people are set in their ways or have practical considerations and don’t
care for new shiny stuff.


Sergey Shelukhin mailto:ser...@hortonworks.com
May 18, 2015 at 11:46
I think we need some path for deprecating old Hadoop versions, the same
way we deprecate old Java version support or old RDBMS version support.
At some point the cost of supporting Hadoop 1 exceeds the benefit. Same
goes for stuff like MR; supporting it, esp. for perf work, becomes a
burden, and it’s outdated with 2 alternatives, one of which has been
around for 2 releases.
The branches are a graceful way to get rid of the legacy burden.

Alternatively, when sweeping changes are made, we can do what Hbase did
(which is not pretty imho), where 0.94 version had ~30 dot releases
because people cannot upgrade to 0.96 “singularity” release.


I posit that people who run Hadoop 1 and MR at this day and age (and more
so as time passes) are people who either don’t care about perf and new
features, only stability; so, stability-focused branch would be perfect to
support them.



Edward Capriolo mailto:edlinuxg...@gmail.com
May 18, 2015 at 10:04
Up until recently Hive supported numerous versions of Hadoop code base 
with

a simple shim layer. I would rather we stick to the shim layer. I