[jira] [Created] (HIVE-10798) Remove dependence on VectorizedBatchUtil from VectorizedOrcAcidRowReader
Owen O'Malley created HIVE-10798: Summary: Remove dependence on VectorizedBatchUtil from VectorizedOrcAcidRowReader Key: HIVE-10798 URL: https://issues.apache.org/jira/browse/HIVE-10798 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley Assignee: Owen O'Malley VectorizedBatchUtil has a lot of dependences that Orc should avoid and the code should be refactored. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Review Request 34473: HIVE-10749 Implement Insert statement for parquet
On May 21, 2015, 7:18 p.m., Sergio Pena wrote: ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetOutputFormat.java, line 59 https://reviews.apache.org/r/34473/diff/2/?file=966160#file966160line59 Could you separate words with _? Like ENABLE_ACID_SCHEMA_INFO. It helps to read the constant more easily. Do we have to enable transactions exclusively for parquet? Isn't there another variable that enables trasnactions on Hive that we can use? This variable is used for setting the schema for parquet. It's only related to whether you need to write data to base file or not. So we have to use this way to append the original data with ACID info. On May 21, 2015, 7:18 p.m., Sergio Pena wrote: ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetOutputFormat.java, lines 98-103 https://reviews.apache.org/r/34473/diff/2/?file=966160#file966160line98 You can use this one line to return the column list: return (ListString) StringUtils.getStringCollection(tableProperties.getProperty(IOConstants.COLUMNS)); It will return an empty list array if COLUMN is empty. Great suggestion! - cheng --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34473/#review84758 --- On May 22, 2015, 6:26 a.m., cheng xu wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34473/ --- (Updated May 22, 2015, 6:26 a.m.) Review request for hive, Alan Gates, Owen O'Malley, and Sergio Pena. Bugs: HIVE-10749 https://issues.apache.org/jira/browse/HIVE-10749 Repository: hive-git Description --- Implement the insert statement for parquet format. Diffs - ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetOutputFormat.java c6fb26c ql/src/java/org/apache/hadoop/hive/ql/io/parquet/acid/ParquetRecordUpdater.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/ParquetRecordReaderWrapper.java f513572 ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ObjectArrayWritableObjectInspector.java 571f993 ql/src/test/org/apache/hadoop/hive/ql/io/parquet/acid/TestParquetRecordUpdater.java PRE-CREATION ql/src/test/queries/clientpositive/acid_parquet_insert.q PRE-CREATION ql/src/test/results/clientpositive/acid_parquet_insert.q.out PRE-CREATION Diff: https://reviews.apache.org/r/34473/diff/ Testing --- Newly added qtest and UT passed locally Thanks, cheng xu
[jira] [Created] (HIVE-10796) Remove dependencies on NumericHistogram and NumDistinctValueEstimator from JavaDataModel
Owen O'Malley created HIVE-10796: Summary: Remove dependencies on NumericHistogram and NumDistinctValueEstimator from JavaDataModel Key: HIVE-10796 URL: https://issues.apache.org/jira/browse/HIVE-10796 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley Assignee: Owen O'Malley The JavaDataModel class is used in a lot of places and the non-general calculations are better done in the other classes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Review Request 34473: HIVE-10749 Implement Insert statement for parquet
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34473/ --- (Updated May 22, 2015, 6:26 a.m.) Review request for hive, Alan Gates, Owen O'Malley, and Sergio Pena. Changes --- Summary: 1. use some utility to reduce LOC 2. remove *ParquetRecordReaderWrapper.java* and use *ObjectArrayWritableObjectInspector* instead Bugs: HIVE-10749 https://issues.apache.org/jira/browse/HIVE-10749 Repository: hive-git Description --- Implement the insert statement for parquet format. Diffs (updated) - ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetOutputFormat.java c6fb26c ql/src/java/org/apache/hadoop/hive/ql/io/parquet/acid/ParquetRecordUpdater.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/ParquetRecordReaderWrapper.java f513572 ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ObjectArrayWritableObjectInspector.java 571f993 ql/src/test/org/apache/hadoop/hive/ql/io/parquet/acid/TestParquetRecordUpdater.java PRE-CREATION ql/src/test/queries/clientpositive/acid_parquet_insert.q PRE-CREATION ql/src/test/results/clientpositive/acid_parquet_insert.q.out PRE-CREATION Diff: https://reviews.apache.org/r/34473/diff/ Testing --- Newly added qtest and UT passed locally Thanks, cheng xu
Re: Review Request 34455: HIVE-10550 Dynamic RDD caching optimization for HoS.[Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34455/#review84876 --- ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ReduceTran.java https://reviews.apache.org/r/34455/#comment136299 use 2 spaces for indent ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ReduceTran.java https://reviews.apache.org/r/34455/#comment136300 use 2 spaces for indent - Alexander Pivovarov On May 22, 2015, 6:18 a.m., chengxiang li wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34455/ --- (Updated May 22, 2015, 6:18 a.m.) Review request for hive, Chao Sun, Jimmy Xiang, and Xuefu Zhang. Bugs: HIVE-10550 https://issues.apache.org/jira/browse/HIVE-10550 Repository: hive-git Description --- see jira description Diffs - common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 43c53fc ql/src/java/org/apache/hadoop/hive/ql/exec/spark/CacheTran.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapTran.java 2170243 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ReduceTran.java e60dfac ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlan.java ee5c78a ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 3f240f5 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java e6c845c ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkRddCachingResolver.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 19aae70 ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java bb5dd79 Diff: https://reviews.apache.org/r/34455/diff/ Testing --- Thanks, chengxiang li
[jira] [Created] (HIVE-10797) Simplify the test for vectorized input
Owen O'Malley created HIVE-10797: Summary: Simplify the test for vectorized input Key: HIVE-10797 URL: https://issues.apache.org/jira/browse/HIVE-10797 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley Assignee: Owen O'Malley The call to Utilities.isVectorMode should be simplified for the readers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-10795) Remove use of PerfLogger from Orc
Owen O'Malley created HIVE-10795: Summary: Remove use of PerfLogger from Orc Key: HIVE-10795 URL: https://issues.apache.org/jira/browse/HIVE-10795 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley Assignee: Owen O'Malley PerfLogger is yet another class with a huge dependency set that Orc doesn't need. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Review Request 34455: HIVE-10550 Dynamic RDD caching optimization for HoS.[Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34455/ --- (Updated May 22, 2015, 6:18 a.m.) Review request for hive, Chao Sun, Jimmy Xiang, and Xuefu Zhang. Changes --- Keep all the previous multi-insert cache code. Bugs: HIVE-10550 https://issues.apache.org/jira/browse/HIVE-10550 Repository: hive-git Description --- see jira description Diffs (updated) - common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 43c53fc ql/src/java/org/apache/hadoop/hive/ql/exec/spark/CacheTran.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapTran.java 2170243 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ReduceTran.java e60dfac ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlan.java ee5c78a ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 3f240f5 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java e6c845c ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkRddCachingResolver.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 19aae70 ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java bb5dd79 Diff: https://reviews.apache.org/r/34455/diff/ Testing --- Thanks, chengxiang li
[jira] [Created] (HIVE-10799) Refactor the SearchArgumentFactory to remove the dependence on ExprNodeGenericFuncDesc
Owen O'Malley created HIVE-10799: Summary: Refactor the SearchArgumentFactory to remove the dependence on ExprNodeGenericFuncDesc Key: HIVE-10799 URL: https://issues.apache.org/jira/browse/HIVE-10799 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley Assignee: Owen O'Malley SearchArgumentFactory and SearchArgumentImpl are high level and shouldn't depend on the internals of Hive's AST model. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [DISCUSS] Supporting Hadoop-1 and experimental features
I understand the motivation and benefits of creating a branch-2 where more disruptive work can go on without affecting branch-1. While not necessarily against this approach, from Yahoo's standpoint, I do have some questions (concerns). Upgrading to a new version of Hive requires a significant commitment of time and resources to stabilize and certify a build for deployment to our clusters. Given the size of our clusters and scale of datasets, we have to be particularly careful about adopting new functionality. However, at the same time we are interested in new testing and making available new features and functionality. That said, we would have to rely on branch-1 for the immediate future. One concern is that branch-1 would be left to stagnate, at which point there would be no option but for users to move to branch-2 as branch-1 would be effectively end-of-lifed. I'm not sure how long this would take, but it would eventually happen as a direct result of the very reason for creating branch-2. A related concern is how disruptive the code changes will be in branch-2. I imagine that changes in early in branch-2 will be easy to backport to branch-1, while this effort will become more difficult, if not impractical, as time goes. If the code bases diverge too much then this could lead to more pressure for users of branch-1 to add features just to branch-1, which has been mentioned as undesirable. By the same token, backporting any code in branch-2 will require an increasing amount of effort, which contributors to branch-2 may not be interested in committing to. These questions affect us directly because, while we require a certain amount of stability, we also like to pull in new functionality that will be of value to our users. For example, our current 0.13 release is probably closer to 0.14 at this point. Given the lifespan of a release, it is often more palatable to backport features and bugfixes than to jump to a new version. The good thing about this proposal is the opportunity to evaluate and clean up alot of the old code. Thanks, chris On Monday, May 18, 2015 11:48 AM, Sergey Shelukhin ser...@hortonworks.com wrote: Note: by “cannot” I mean “are unwilling to”; upgrade paths exist, but some people are set in their ways or have practical considerations and don’t care for new shiny stuff. On 15/5/18, 11:46, Sergey Shelukhin ser...@hortonworks.com wrote: I think we need some path for deprecating old Hadoop versions, the same way we deprecate old Java version support or old RDBMS version support. At some point the cost of supporting Hadoop 1 exceeds the benefit. Same goes for stuff like MR; supporting it, esp. for perf work, becomes a burden, and it’s outdated with 2 alternatives, one of which has been around for 2 releases. The branches are a graceful way to get rid of the legacy burden. Alternatively, when sweeping changes are made, we can do what Hbase did (which is not pretty imho), where 0.94 version had ~30 dot releases because people cannot upgrade to 0.96 “singularity” release. I posit that people who run Hadoop 1 and MR at this day and age (and more so as time passes) are people who either don’t care about perf and new features, only stability; so, stability-focused branch would be perfect to support them. On 15/5/18, 10:04, Edward Capriolo edlinuxg...@gmail.com wrote: Up until recently Hive supported numerous versions of Hadoop code base with a simple shim layer. I would rather we stick to the shim layer. I think this was easily the best part about hive was that a single release worked well regardless of your hadoop version. It was also a key element to hive's success. I do not want to see us have multiple branches. On Sat, May 16, 2015 at 1:29 AM, Xuefu Zhang xzh...@cloudera.com wrote: Thanks for the explanation, Alan! While I have understood more on the proposal, I actually see more problems than the confusion of two lines of releases. Essentially, this proposal forces a user to make a hard choice between a stabler, legacy-aware release line and an adventurous, pioneering release line. And once the choice is made, there is no easy way back or forward. Here is my interpretation. Let's say we have two main branches as proposed. I develop a new feature which I think useful for both branches. So, I commit it to both branches. My feature requires additional schema support, so I provide upgrade scripts for both branches. The scripts are different because the two branches have already diverged in schema. Now the two branches evolve in a diverging fashion like this. This is all good as long as a user stays in his line. The moment the user considers a switch, mostly likely, from branch-1 to branch-2, he is stuck. Why? Because there is no upgrade path from a release in branch-1 to a release in branch-2! If we want to provide an upgrade path, then there will be MxN paths, where M and N are the number of releases in the two branches,
Re: [DISCUSS] Supporting Hadoop-1 and experimental features
On Fri, May 22, 2015 at 1:19 PM, Alan Gates alanfga...@gmail.com wrote: I see your point on saying the contributor may not understand where best to put the patch, and thus the committer decides. However, it would be very disappointing for a contributor who uses branch-1 to build a new feature only to have the committer put it only in master. So I would modify your modification to say at the discretion of the contributor and Hive committers. For what its worth, this is more or less how HBase works. All features land first in master and then percolate backwards to open, active branches where's it's acceptable to do so. Since our 1.0 release, we're trying to make 1.0+ follow more closely to semantic versioning. This means that new features never land in a released minor branch. Bug fixes are applied to all applicable branches, sometimes this means older release branches and not master. Sometimes that means contributors are forced to upgrade in order to take advantage of their contribution in an Apache release (they're fine to run their own patched builds as they like; it's open source). Right now we have: master - (unreleased, development branch for eventual 2.0) branch-1 - (unreleased, development branch for 1.x series, soon to be branch basis for 1.2) branch-1.1 - (released branch, accepting only bug fixes for 1.1.x line) branch-1.0 - (released branch, accepting only bug fixes for 1.0.x line) When we're ready, branch-1.2 will fork from branch-1 and branch-1 will become development branch for 1.3. Eventually we'll decide it's time for 2.0 and master will be branched, creating branch-2. branch-2 will follow the same process. We also maintain active branches for 0.98.x and 0.94.x. These branches are different, following our old model of receiving backward-compatible new features in .x versions. 0.94 is basically retired now, only getting bug fixes. 0.94 is only hadoop-1, 0.98 supports both hadoop-1 and hadoop-2 (maybe we've retired hadoop-2 support here in the .12 release?), 1.x support hadoop-2 only. 2.0 is undecided, but presumably will be hadoop-2 and hadoop-3 if we can extend our shim layer for it. We have separate release managers for 0.94, 0.98, 1.0, and 1.1, and we're discussing preparations for 1.2. They enforce commits against their respective branches. kulkarni.swar...@gmail.com May 22, 2015 at 11:41 +1 on the new proposal. Feedback below: New features must be put into master. Whether to put them into branch-1 is at the discretion of the developer. How about we change this to *All* features must be put into master. Whether to put them into branch-1 is at the discretion of the *committer*. The reason I think is going forward for us to sustain as a happy and healthy community, it's imperative for us to make it not only easy for the users, but also for developers and committers to contribute/commit patches. To me being a hive contributor would be hard to determine which branch my code belongs. Also IMO(and I might be wrong) but many committers have their own areas of expertise and it's also very hard for them to immediately determine what branch a patch should go to unless very well documented somewhere. Putting all code into the master would be an easy approach to follow and then cherry picking to other branches can be done. So even if people forget to do that, we can always go back to master and port the patches out to these branches. So we have a master branch, a branch-1 for stable code, branch-2 for experimental and bleeding edge code and so on. Once branch-2 is stable, we deprecate branch-1, create branch-3 and move on. Another reason I say this is because in my experience, a pretty significant amount of work is hive is still bug fixes and I think that is what the user cares most about(correctness above anything else). So with this approach, might be very obvious to what branches to commit this to. -- Swarnim Chris Drome cdr...@yahoo-inc.com.INVALID May 22, 2015 at 0:49 I understand the motivation and benefits of creating a branch-2 where more disruptive work can go on without affecting branch-1. While not necessarily against this approach, from Yahoo's standpoint, I do have some questions (concerns). Upgrading to a new version of Hive requires a significant commitment of time and resources to stabilize and certify a build for deployment to our clusters. Given the size of our clusters and scale of datasets, we have to be particularly careful about adopting new functionality. However, at the same time we are interested in new testing and making available new features and functionality. That said, we would have to rely on branch-1 for the immediate future. One concern is that branch-1 would be left to stagnate, at which point there would be no option but for users to move to branch-2 as branch-1 would be effectively end-of-lifed. I'm not sure how long this would take, but it would eventually happen as a direct result of the
[jira] [Created] (HIVE-10804) CBO: Calcite Operator To Hive Operator (Calcite Return Path): optimizer for limit 0 does not work
Pengcheng Xiong created HIVE-10804: -- Summary: CBO: Calcite Operator To Hive Operator (Calcite Return Path): optimizer for limit 0 does not work Key: HIVE-10804 URL: https://issues.apache.org/jira/browse/HIVE-10804 Project: Hive Issue Type: Sub-task Reporter: Pengcheng Xiong Assignee: Pengcheng Xiong {code} explain select key,value from src order by key limit 0 POSTHOOK: type: QUERY STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: src Statistics: Num rows: 500 Data size: 5312 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: key (type: string), value (type: string) outputColumnNames: key, value Statistics: Num rows: 500 Data size: 5312 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: key (type: string) sort order: + Statistics: Num rows: 500 Data size: 5312 Basic stats: COMPLETE Column stats: NONE value expressions: value (type: string) Reduce Operator Tree: Select Operator expressions: KEY.reducesinkkey0 (type: string), VALUE.value (type: string) outputColumnNames: key, value Statistics: Num rows: 500 Data size: 5312 Basic stats: COMPLETE Column stats: NONE Limit Number of rows: 0 Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [DISCUSS] Supporting Hadoop-1 and experimental features
I agree with *All* features with the exception that some features might be branch-1 specific (if it's a feature on something no longer supported in master, like hadoop-1). Without this we prevent new features for older technology, which doesn't strike me as reasonable. I see your point on saying the contributor may not understand where best to put the patch, and thus the committer decides. However, it would be very disappointing for a contributor who uses branch-1 to build a new feature only to have the committer put it only in master. So I would modify your modification to say at the discretion of the contributor and Hive committers. Alan. kulkarni.swar...@gmail.com mailto:kulkarni.swar...@gmail.com May 22, 2015 at 11:41 +1 on the new proposal. Feedback below: New features must be put into master. Whether to put them into branch-1 is at the discretion of the developer. How about we change this to *_All_* features must be put into master. Whether to put them into branch-1 is at the discretion of the *_committer_*. The reason I think is going forward for us to sustain as a happy and healthy community, it's imperative for us to make it not only easy for the users, but also for developers and committers to contribute/commit patches. To me being a hive contributor would be hard to determine which branch my code belongs. Also IMO(and I might be wrong) but many committers have their own areas of expertise and it's also very hard for them to immediately determine what branch a patch should go to unless very well documented somewhere. Putting all code into the master would be an easy approach to follow and then cherry picking to other branches can be done. So even if people forget to do that, we can always go back to master and port the patches out to these branches. So we have a master branch, a branch-1 for stable code, branch-2 for experimental and bleeding edge code and so on. Once branch-2 is stable, we deprecate branch-1, create branch-3 and move on. Another reason I say this is because in my experience, a pretty significant amount of work is hive is still bug fixes and I think that is what the user cares most about(correctness above anything else). So with this approach, might be very obvious to what branches to commit this to. -- Swarnim Chris Drome mailto:cdr...@yahoo-inc.com.INVALID May 22, 2015 at 0:49 I understand the motivation and benefits of creating a branch-2 where more disruptive work can go on without affecting branch-1. While not necessarily against this approach, from Yahoo's standpoint, I do have some questions (concerns). Upgrading to a new version of Hive requires a significant commitment of time and resources to stabilize and certify a build for deployment to our clusters. Given the size of our clusters and scale of datasets, we have to be particularly careful about adopting new functionality. However, at the same time we are interested in new testing and making available new features and functionality. That said, we would have to rely on branch-1 for the immediate future. One concern is that branch-1 would be left to stagnate, at which point there would be no option but for users to move to branch-2 as branch-1 would be effectively end-of-lifed. I'm not sure how long this would take, but it would eventually happen as a direct result of the very reason for creating branch-2. A related concern is how disruptive the code changes will be in branch-2. I imagine that changes in early in branch-2 will be easy to backport to branch-1, while this effort will become more difficult, if not impractical, as time goes. If the code bases diverge too much then this could lead to more pressure for users of branch-1 to add features just to branch-1, which has been mentioned as undesirable. By the same token, backporting any code in branch-2 will require an increasing amount of effort, which contributors to branch-2 may not be interested in committing to. These questions affect us directly because, while we require a certain amount of stability, we also like to pull in new functionality that will be of value to our users. For example, our current 0.13 release is probably closer to 0.14 at this point. Given the lifespan of a release, it is often more palatable to backport features and bugfixes than to jump to a new version. The good thing about this proposal is the opportunity to evaluate and clean up alot of the old code. Thanks, chris On Monday, May 18, 2015 11:48 AM, Sergey Shelukhin ser...@hortonworks.com wrote: Note: by “cannot” I mean “are unwilling to”; upgrade paths exist, but some people are set in their ways or have practical considerations and don’t care for new shiny stuff. Sergey Shelukhin mailto:ser...@hortonworks.com May 18, 2015 at 11:47 Note: by “cannot” I mean “are unwilling to”; upgrade paths exist, but some people are set in their ways or have practical considerations and don’t care for new shiny
[jira] [Created] (HIVE-10802) Table join query with some constant field in select fails
Aihua Xu created HIVE-10802: --- Summary: Table join query with some constant field in select fails Key: HIVE-10802 URL: https://issues.apache.org/jira/browse/HIVE-10802 Project: Hive Issue Type: Bug Components: Query Planning Affects Versions: 1.2.0 Reporter: Aihua Xu The following query fails: {noformat} create table tb1 (year string, month string); create table tb2(month string); select unix_timestamp(a.year) from (select * from tb1 where year='2001') a join tb2 b on (a.month=b.month); {noformat} with the exception {noformat} Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 at java.util.ArrayList.rangeCheck(ArrayList.java:635) at java.util.ArrayList.get(ArrayList.java:411) at org.apache.hadoop.hive.serde2.objectinspector.StandardStructObjectInspector.init(StandardStructObjectInspector.java:118) at org.apache.hadoop.hive.serde2.objectinspector.StandardStructObjectInspector.init(StandardStructObjectInspector.java:109) at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory.getStandardStructObjectInspector(ObjectInspectorFactory.java:290) at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory.getStandardStructObjectInspector(ObjectInspectorFactory.java:275) at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.getJoinOutputObjectInspector(CommonJoinOperator.java:175) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-10803) document jdbc url format properly
Thejas M Nair created HIVE-10803: Summary: document jdbc url format properly Key: HIVE-10803 URL: https://issues.apache.org/jira/browse/HIVE-10803 Project: Hive Issue Type: Bug Components: Documentation, HiveServer2 Reporter: Thejas M Nair This is the format of the HS2 string, this needs to be documented in the wiki doc (taken from jdbc.Utils.java) jdbc:hive2://host1:port1,host2:port2/dbName;sess_var_list?hive_conf_list#hive_var_list -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-10805) OOM in vectorized reduce
Matt McCline created HIVE-10805: --- Summary: OOM in vectorized reduce Key: HIVE-10805 URL: https://issues.apache.org/jira/browse/HIVE-10805 Project: Hive Issue Type: Bug Reporter: Matt McCline Assignee: Matt McCline Priority: Blocker Fix For: 1.2.1 Vectorized reduce does not release scratch byte space in BytesColumnVectors and runs out of memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-10806) Incorrect example for exploding map function in hive wiki
anup b created HIVE-10806: - Summary: Incorrect example for exploding map function in hive wiki Key: HIVE-10806 URL: https://issues.apache.org/jira/browse/HIVE-10806 Project: Hive Issue Type: Bug Components: Documentation Affects Versions: 0.10.0 Reporter: anup b Priority: Trivial In hive wiki, example for exploding map is wrong it doesnt work in hive 0.10 Example given in wiki which doesnt work: SELECT explode(myMap) AS myMapKey, myMapValue FROM myMapTable; It should be updated to : SELECT explode(myMap) AS (myMapKey, myMapValue) FROM myMapTable; Link : https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-explode -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-10809) HCat FileOutputCommitterContainer leaves behind empty _SCRATCH directories
Selina Zhang created HIVE-10809: --- Summary: HCat FileOutputCommitterContainer leaves behind empty _SCRATCH directories Key: HIVE-10809 URL: https://issues.apache.org/jira/browse/HIVE-10809 Project: Hive Issue Type: Bug Components: HCatalog Affects Versions: 1.2.0 Reporter: Selina Zhang Assignee: Selina Zhang When static partition is added through HCatStorer or HCatWriter {code} JoinedData = LOAD '/user/selinaz/data/part-r-0' USING JsonLoader(); STORE JoinedData INTO 'selina.joined_events_e' USING org.apache.hive.hcatalog.pig.HCatStorer('author=selina'); {code} The table directory looks like {noformat} drwx-- - selinaz users 0 2015-05-22 21:19 /user/selinaz/joined_events_e/_SCRATCH0.9157208938193798 drwx-- - selinaz users 0 2015-05-22 21:19 /user/selinaz/joined_events_e/author=selina {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-10808) Inner join on Null throwing Cast Exception
Naveen Gangam created HIVE-10808: Summary: Inner join on Null throwing Cast Exception Key: HIVE-10808 URL: https://issues.apache.org/jira/browse/HIVE-10808 Project: Hive Issue Type: Bug Components: HiveServer2 Affects Versions: 0.13.1 Reporter: Naveen Gangam Assignee: Naveen Gangam Priority: Critical select a.col1, a.col2, a.col3, a.col4 from tab1 a inner join ( select max(x) as x from tab1 where x 20130327 ) r on a.x = r.x where a.col1 = 'F' and a.col3 in ('A', 'S', 'G'); Failed Task log snippet: 2015-05-18 19:22:17,372 INFO [main] org.apache.hadoop.hive.ql.exec.mr.ObjectCache: Ignoring retrieval request: __MAP_PLAN__ 2015-05-18 19:22:17,372 INFO [main] org.apache.hadoop.hive.ql.exec.mr.ObjectCache: Ignoring cache key: __MAP_PLAN__ 2015-05-18 19:22:17,457 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:446) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106) ... 9 more Caused by: java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:38) ... 14 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106) ... 17 more Caused by: java.lang.RuntimeException: Map operator initialization failed at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.configure(ExecMapper.java:157) ... 22 more Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: org.apache.hadoop.hive.serde2.NullStructSerDe$NullStructSerDeObjectInspector cannot be cast to org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector at org.apache.hadoop.hive.ql.exec.MapOperator.getConvertedOI(MapOperator.java:334) at org.apache.hadoop.hive.ql.exec.MapOperator.setChildren(MapOperator.java:352) at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.configure(ExecMapper.java:126) ... 22 more Caused by: java.lang.ClassCastException: org.apache.hadoop.hive.serde2.NullStructSerDe$NullStructSerDeObjectInspector cannot be cast to org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.isInstanceOfSettableOI(ObjectInspectorUtils.java:) at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.hasAllFieldsSettable(ObjectInspectorUtils.java:1149) at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters.getConvertedOI(ObjectInspectorConverters.java:219) at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters.getConvertedOI(ObjectInspectorConverters.java:183) at org.apache.hadoop.hive.ql.exec.MapOperator.getConvertedOI(MapOperator.java:316) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-10807) Invalidate basic stats for insert queries if autogather=false
Ashutosh Chauhan created HIVE-10807: --- Summary: Invalidate basic stats for insert queries if autogather=false Key: HIVE-10807 URL: https://issues.apache.org/jira/browse/HIVE-10807 Project: Hive Issue Type: Bug Components: Statistics Affects Versions: 1.2.0 Reporter: Gopal V Assignee: Ashutosh Chauhan if stats.autogather=false leads to incorrect basic stats in case of insert statements. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-10801) 'drop view' fails throwing java.lang.NullPointerException
Hari Sankar Sivarama Subramaniyan created HIVE-10801: Summary: 'drop view' fails throwing java.lang.NullPointerException Key: HIVE-10801 URL: https://issues.apache.org/jira/browse/HIVE-10801 Project: Hive Issue Type: Bug Reporter: Hari Sankar Sivarama Subramaniyan Assignee: Hari Sankar Sivarama Subramaniyan When trying to drop a view, hive log shows: {code} 2015-05-21 11:53:06,126 ERROR [HiveServer2-Background-Pool: Thread-197]: hdfs.KeyProviderCache (KeyProviderCache.java:createKeyProviderURI(87)) - Could not find uri with key [dfs.encryption.key.provider.uri] to create a keyProvider !! 2015-05-21 11:53:06,134 ERROR [HiveServer2-Background-Pool: Thread-197]: metastore.RetryingHMSHandler (RetryingHMSHandler.java:invoke(155)) - MetaException(message:java.lang.NullPointerException) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newMetaException(HiveMetaStore.java:5379) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.drop_table_with_environment_context(HiveMetaStore.java:1734) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107) at com.sun.proxy.$Proxy7.drop_table_with_environment_context(Unknown Source) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.drop_table_with_environment_context(HiveMetaStoreClient.java:2056) at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.drop_table_with_environment_context(SessionHiveMetaStoreClient.java:118) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.dropTable(HiveMetaStoreClient.java:968) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.dropTable(HiveMetaStoreClient.java:904) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156) at com.sun.proxy.$Proxy8.dropTable(Unknown Source) at org.apache.hadoop.hive.ql.metadata.Hive.dropTable(Hive.java:1035) at org.apache.hadoop.hive.ql.metadata.Hive.dropTable(Hive.java:972) at org.apache.hadoop.hive.ql.exec.DDLTask.dropTable(DDLTask.java:3836) at org.apache.hadoop.hive.ql.exec.DDLTask.dropTableOrPartitions(DDLTask.java:3692) at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:331) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:88) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1650) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1409) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1192) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1059) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1054) at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:154) at org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:71) at org.apache.hive.service.cli.operation.SQLOperation$1$1.run(SQLOperation.java:206) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:218) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.NullPointerException at org.apache.hadoop.hive.shims.Hadoop23Shims$HdfsEncryptionShim.isPathEncrypted(Hadoop23Shims.java:1213) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.drop_table_core(HiveMetaStore.java:1546) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.drop_table_with_environment_context(HiveMetaStore.java:1723) ... 40 more
Build failed in Jenkins: HIVE-TRUNK-JAVA8 #72
See http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/HIVE-TRUNK-JAVA8/72/ -- Started by timer Building in workspace http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/HIVE-TRUNK-JAVA8/ws/ git rev-parse --is-inside-work-tree # timeout=10 Fetching changes from the remote Git repository git config remote.origin.url https://git-wip-us.apache.org/repos/asf/hive.git # timeout=10 Fetching upstream changes from https://git-wip-us.apache.org/repos/asf/hive.git git --version # timeout=10 git fetch --tags --progress https://git-wip-us.apache.org/repos/asf/hive.git +refs/heads/*:refs/remotes/origin/* ERROR: Error fetching remote repo 'origin' ERROR: Error fetching remote repo 'origin' Archiving artifacts Recording test results
Re: Review Request 34593: HIVE-10702 COUNT(*) over windowing 'x preceding and y preceding' doesn't work properly
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34593/ --- (Updated May 22, 2015, 11:57 a.m.) Review request for hive. Repository: hive-git Description (updated) --- HIVE-10702 COUNT(*) over windowing 'x preceding and y preceding' doesn't work properly Diffs - ql/src/java/org/apache/hadoop/hive/ql/exec/PTFRollingPartition.java e195c0a2815687ded15d186cfe6279fdbc212819 ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/WindowingTableFunction.java d7817d90dce7c851affdf35aff65ce3de259c866 ql/src/test/queries/clientpositive/windowing_windowspec2.q 3e8aa93494c0ad9119f475deca9edef74beb8a46 ql/src/test/results/clientpositive/windowing_windowspec2.q.out 0879344a2364532c53ffc697ea402d99701d3723 Diff: https://reviews.apache.org/r/34593/diff/ Testing --- Test has been done here https://issues.apache.org/jira/browse/HIVE-10702 Seems one test failed for unrelated reason. Thanks, Aihua Xu
Review Request 34593: HIVE-10702 COUNT(*) over windowing 'x preceding and y preceding' doesn't work properly
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34593/ --- Review request for hive. Repository: hive-git Description --- HIVE-10702 COUNT(*) over windowing 'x preceding and y preceding' doesn't work properly Diffs - ql/src/java/org/apache/hadoop/hive/ql/exec/PTFRollingPartition.java e195c0a2815687ded15d186cfe6279fdbc212819 ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/WindowingTableFunction.java d7817d90dce7c851affdf35aff65ce3de259c866 ql/src/test/queries/clientpositive/windowing_windowspec2.q 3e8aa93494c0ad9119f475deca9edef74beb8a46 ql/src/test/results/clientpositive/windowing_windowspec2.q.out 0879344a2364532c53ffc697ea402d99701d3723 Diff: https://reviews.apache.org/r/34593/diff/ Testing --- Test has been done here https://issues.apache.org/jira/browse/HIVE-10702 Seems one test failed for unrelated reason. Thanks, Aihua Xu
[jira] [Created] (HIVE-10800) CBO (Calcite Return Path): Setup correct information if CBO succeeds
Jesus Camacho Rodriguez created HIVE-10800: -- Summary: CBO (Calcite Return Path): Setup correct information if CBO succeeds Key: HIVE-10800 URL: https://issues.apache.org/jira/browse/HIVE-10800 Project: Hive Issue Type: Sub-task Reporter: Jesus Camacho Rodriguez Assignee: Jesus Camacho Rodriguez -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [DISCUSS] Supporting Hadoop-1 and experimental features
I think branch-2 doesn’t need to be framed as particularly adventurous (other than due to general increase of the amount of work done in Hive by community). All the new features that normally go on trunk/master will go to branch-2. branch-2 is just trunk as it is now, in fact there will be no branch-2, just master :) The difference is the dropped functionality, not added one. So you shouldn’t lose stability if you retain the same process as now by just staying on versions off master. Perhaps, as is usually the case in Apache projects, developing features on older branches would be discouraged. Right now, all features usually go on trunk/master, and are then back ported as needed and practical; so you wouldn’t (in Apache) make a feature on Hive 0.14 to be released in 0.14.N, and not back port to master. On 15/5/22, 00:49, Chris Drome cdr...@yahoo-inc.com.INVALID wrote: I understand the motivation and benefits of creating a branch-2 where more disruptive work can go on without affecting branch-1. While not necessarily against this approach, from Yahoo's standpoint, I do have some questions (concerns). Upgrading to a new version of Hive requires a significant commitment of time and resources to stabilize and certify a build for deployment to our clusters. Given the size of our clusters and scale of datasets, we have to be particularly careful about adopting new functionality. However, at the same time we are interested in new testing and making available new features and functionality. That said, we would have to rely on branch-1 for the immediate future. One concern is that branch-1 would be left to stagnate, at which point there would be no option but for users to move to branch-2 as branch-1 would be effectively end-of-lifed. I'm not sure how long this would take, but it would eventually happen as a direct result of the very reason for creating branch-2. A related concern is how disruptive the code changes will be in branch-2. I imagine that changes in early in branch-2 will be easy to backport to branch-1, while this effort will become more difficult, if not impractical, as time goes. If the code bases diverge too much then this could lead to more pressure for users of branch-1 to add features just to branch-1, which has been mentioned as undesirable. By the same token, backporting any code in branch-2 will require an increasing amount of effort, which contributors to branch-2 may not be interested in committing to. These questions affect us directly because, while we require a certain amount of stability, we also like to pull in new functionality that will be of value to our users. For example, our current 0.13 release is probably closer to 0.14 at this point. Given the lifespan of a release, it is often more palatable to backport features and bugfixes than to jump to a new version. The good thing about this proposal is the opportunity to evaluate and clean up alot of the old code. Thanks, chris On Monday, May 18, 2015 11:48 AM, Sergey Shelukhin ser...@hortonworks.com wrote: Note: by “cannot” I mean “are unwilling to”; upgrade paths exist, but some people are set in their ways or have practical considerations and don’t care for new shiny stuff. On 15/5/18, 11:46, Sergey Shelukhin ser...@hortonworks.com wrote: I think we need some path for deprecating old Hadoop versions, the same way we deprecate old Java version support or old RDBMS version support. At some point the cost of supporting Hadoop 1 exceeds the benefit. Same goes for stuff like MR; supporting it, esp. for perf work, becomes a burden, and it’s outdated with 2 alternatives, one of which has been around for 2 releases. The branches are a graceful way to get rid of the legacy burden. Alternatively, when sweeping changes are made, we can do what Hbase did (which is not pretty imho), where 0.94 version had ~30 dot releases because people cannot upgrade to 0.96 “singularity” release. I posit that people who run Hadoop 1 and MR at this day and age (and more so as time passes) are people who either don’t care about perf and new features, only stability; so, stability-focused branch would be perfect to support them. On 15/5/18, 10:04, Edward Capriolo edlinuxg...@gmail.com wrote: Up until recently Hive supported numerous versions of Hadoop code base with a simple shim layer. I would rather we stick to the shim layer. I think this was easily the best part about hive was that a single release worked well regardless of your hadoop version. It was also a key element to hive's success. I do not want to see us have multiple branches. On Sat, May 16, 2015 at 1:29 AM, Xuefu Zhang xzh...@cloudera.com wrote: Thanks for the explanation, Alan! While I have understood more on the proposal, I actually see more problems than the confusion of two lines of releases. Essentially, this proposal forces a user to make a hard choice between a stabler, legacy-aware release line and an adventurous, pioneering release line. And once the
Re: [DISCUSS] Supporting Hadoop-1 and experimental features
+1 on the new proposal. Feedback below: New features must be put into master. Whether to put them into branch-1 is at the discretion of the developer. How about we change this to *All* features must be put into master. Whether to put them into branch-1 is at the discretion of the *committer*. The reason I think is going forward for us to sustain as a happy and healthy community, it's imperative for us to make it not only easy for the users, but also for developers and committers to contribute/commit patches. To me being a hive contributor would be hard to determine which branch my code belongs. Also IMO(and I might be wrong) but many committers have their own areas of expertise and it's also very hard for them to immediately determine what branch a patch should go to unless very well documented somewhere. Putting all code into the master would be an easy approach to follow and then cherry picking to other branches can be done. So even if people forget to do that, we can always go back to master and port the patches out to these branches. So we have a master branch, a branch-1 for stable code, branch-2 for experimental and bleeding edge code and so on. Once branch-2 is stable, we deprecate branch-1, create branch-3 and move on. Another reason I say this is because in my experience, a pretty significant amount of work is hive is still bug fixes and I think that is what the user cares most about(correctness above anything else). So with this approach, might be very obvious to what branches to commit this to. On Fri, May 22, 2015 at 1:11 PM, Alan Gates alanfga...@gmail.com wrote: Thanks for your feedback Chris. It sounds like there are a couple of reasonable concerns being voiced repeatedly: 1) Fragmentation, the two branches will drift too far apart. 2) Stagnation, branch-1 will effectively become a dead-end. So I modify the proposal as follows to deal with those: 1) New features must be put into master. Whether to put them into branch-1 is at the discretion of the developer. The exception would be features that would not apply in master (e.g. say someone developed a way to double the speed of map reduce jobs Hive produces). For example, I might choose to put the materialized view work I'm doing in both branch-1 and master, but the HBase metastore work only in master. This should avoid fragmentation by keeping branch-1 a subset of master. 2) For the next 12 months we will port critical bug fixes (crashes, security issues, wrong results) to branch-1 as well as fixing them on master. We might choose to lengthen this time depending on how stable master is and how fast the uptake is. This avoids branch-1 being immediately abandoned by developers while users are still depending on it. Alan. Chris Drome cdr...@yahoo-inc.com.INVALID May 22, 2015 at 0:49 I understand the motivation and benefits of creating a branch-2 where more disruptive work can go on without affecting branch-1. While not necessarily against this approach, from Yahoo's standpoint, I do have some questions (concerns). Upgrading to a new version of Hive requires a significant commitment of time and resources to stabilize and certify a build for deployment to our clusters. Given the size of our clusters and scale of datasets, we have to be particularly careful about adopting new functionality. However, at the same time we are interested in new testing and making available new features and functionality. That said, we would have to rely on branch-1 for the immediate future. One concern is that branch-1 would be left to stagnate, at which point there would be no option but for users to move to branch-2 as branch-1 would be effectively end-of-lifed. I'm not sure how long this would take, but it would eventually happen as a direct result of the very reason for creating branch-2. A related concern is how disruptive the code changes will be in branch-2. I imagine that changes in early in branch-2 will be easy to backport to branch-1, while this effort will become more difficult, if not impractical, as time goes. If the code bases diverge too much then this could lead to more pressure for users of branch-1 to add features just to branch-1, which has been mentioned as undesirable. By the same token, backporting any code in branch-2 will require an increasing amount of effort, which contributors to branch-2 may not be interested in committing to. These questions affect us directly because, while we require a certain amount of stability, we also like to pull in new functionality that will be of value to our users. For example, our current 0.13 release is probably closer to 0.14 at this point. Given the lifespan of a release, it is often more palatable to backport features and bugfixes than to jump to a new version. The good thing about this proposal is the opportunity to evaluate and clean up alot of the old code. Thanks, chris On Monday, May 18, 2015 11:48 AM, Sergey Shelukhin
Re: [DISCUSS] Supporting Hadoop-1 and experimental features
Looks like we discussing 3 options: 1. Support hadoop 1, 2 and 3 in master branch. 2. Support hadoop 1 in branch-1, hadoop 2 in branch-2, hadoop 3 in branch-3 3. Support hadoop 2 and 3 in master I DO not think option 2 is good solution because it is much more difficuilt to manage 3 active prod branches rather than one master branch. I think we should go with options 1 or 3. +1 on Xuefu and Edward opinion On May 22, 2015 9:09 AM, Sergey Shelukhin ser...@hortonworks.com wrote: I think branch-2 doesn’t need to be framed as particularly adventurous (other than due to general increase of the amount of work done in Hive by community). All the new features that normally go on trunk/master will go to branch-2. branch-2 is just trunk as it is now, in fact there will be no branch-2, just master :) The difference is the dropped functionality, not added one. So you shouldn’t lose stability if you retain the same process as now by just staying on versions off master. Perhaps, as is usually the case in Apache projects, developing features on older branches would be discouraged. Right now, all features usually go on trunk/master, and are then back ported as needed and practical; so you wouldn’t (in Apache) make a feature on Hive 0.14 to be released in 0.14.N, and not back port to master. On 15/5/22, 00:49, Chris Drome cdr...@yahoo-inc.com.INVALID wrote: I understand the motivation and benefits of creating a branch-2 where more disruptive work can go on without affecting branch-1. While not necessarily against this approach, from Yahoo's standpoint, I do have some questions (concerns). Upgrading to a new version of Hive requires a significant commitment of time and resources to stabilize and certify a build for deployment to our clusters. Given the size of our clusters and scale of datasets, we have to be particularly careful about adopting new functionality. However, at the same time we are interested in new testing and making available new features and functionality. That said, we would have to rely on branch-1 for the immediate future. One concern is that branch-1 would be left to stagnate, at which point there would be no option but for users to move to branch-2 as branch-1 would be effectively end-of-lifed. I'm not sure how long this would take, but it would eventually happen as a direct result of the very reason for creating branch-2. A related concern is how disruptive the code changes will be in branch-2. I imagine that changes in early in branch-2 will be easy to backport to branch-1, while this effort will become more difficult, if not impractical, as time goes. If the code bases diverge too much then this could lead to more pressure for users of branch-1 to add features just to branch-1, which has been mentioned as undesirable. By the same token, backporting any code in branch-2 will require an increasing amount of effort, which contributors to branch-2 may not be interested in committing to. These questions affect us directly because, while we require a certain amount of stability, we also like to pull in new functionality that will be of value to our users. For example, our current 0.13 release is probably closer to 0.14 at this point. Given the lifespan of a release, it is often more palatable to backport features and bugfixes than to jump to a new version. The good thing about this proposal is the opportunity to evaluate and clean up alot of the old code. Thanks, chris On Monday, May 18, 2015 11:48 AM, Sergey Shelukhin ser...@hortonworks.com wrote: Note: by “cannot” I mean “are unwilling to”; upgrade paths exist, but some people are set in their ways or have practical considerations and don’t care for new shiny stuff. On 15/5/18, 11:46, Sergey Shelukhin ser...@hortonworks.com wrote: I think we need some path for deprecating old Hadoop versions, the same way we deprecate old Java version support or old RDBMS version support. At some point the cost of supporting Hadoop 1 exceeds the benefit. Same goes for stuff like MR; supporting it, esp. for perf work, becomes a burden, and it’s outdated with 2 alternatives, one of which has been around for 2 releases. The branches are a graceful way to get rid of the legacy burden. Alternatively, when sweeping changes are made, we can do what Hbase did (which is not pretty imho), where 0.94 version had ~30 dot releases because people cannot upgrade to 0.96 “singularity” release. I posit that people who run Hadoop 1 and MR at this day and age (and more so as time passes) are people who either don’t care about perf and new features, only stability; so, stability-focused branch would be perfect to support them. On 15/5/18, 10:04, Edward Capriolo edlinuxg...@gmail.com wrote: Up until recently Hive supported numerous versions of Hadoop code base with a simple shim layer. I would rather we stick to the shim layer. I think this was easily the best part about
Re: [DISCUSS] Supporting Hadoop-1 and experimental features
Alan, your email client is not compatible with gmail viewer. For some reason your reply contains the whole thread of the discussion On May 22, 2015 10:58 AM, Alan Gates alanfga...@gmail.com wrote: I don't think anyone is advocating for option 2, as that would be disastrous. Option 3 is closest to what I'm proposing, though again dropping support for Hadoop 1 is only a part of it. Alan. Alexander Pivovarov apivova...@gmail.com May 22, 2015 at 10:03 Looks like we discussing 3 options: 1. Support hadoop 1, 2 and 3 in master branch. 2. Support hadoop 1 in branch-1, hadoop 2 in branch-2, hadoop 3 in branch-3 3. Support hadoop 2 and 3 in master I DO not think option 2 is good solution because it is much more difficuilt to manage 3 active prod branches rather than one master branch. I think we should go with options 1 or 3. +1 on Xuefu and Edward opinion Sergey Shelukhin ser...@hortonworks.com May 22, 2015 at 9:08 I think branch-2 doesn’t need to be framed as particularly adventurous (other than due to general increase of the amount of work done in Hive by community). All the new features that normally go on trunk/master will go to branch-2. branch-2 is just trunk as it is now, in fact there will be no branch-2, just master :) The difference is the dropped functionality, not added one. So you shouldn’t lose stability if you retain the same process as now by just staying on versions off master. Perhaps, as is usually the case in Apache projects, developing features on older branches would be discouraged. Right now, all features usually go on trunk/master, and are then back ported as needed and practical; so you wouldn’t (in Apache) make a feature on Hive 0.14 to be released in 0.14.N, and not back port to master. Chris Drome cdr...@yahoo-inc.com.INVALID May 22, 2015 at 0:49 I understand the motivation and benefits of creating a branch-2 where more disruptive work can go on without affecting branch-1. While not necessarily against this approach, from Yahoo's standpoint, I do have some questions (concerns). Upgrading to a new version of Hive requires a significant commitment of time and resources to stabilize and certify a build for deployment to our clusters. Given the size of our clusters and scale of datasets, we have to be particularly careful about adopting new functionality. However, at the same time we are interested in new testing and making available new features and functionality. That said, we would have to rely on branch-1 for the immediate future. One concern is that branch-1 would be left to stagnate, at which point there would be no option but for users to move to branch-2 as branch-1 would be effectively end-of-lifed. I'm not sure how long this would take, but it would eventually happen as a direct result of the very reason for creating branch-2. A related concern is how disruptive the code changes will be in branch-2. I imagine that changes in early in branch-2 will be easy to backport to branch-1, while this effort will become more difficult, if not impractical, as time goes. If the code bases diverge too much then this could lead to more pressure for users of branch-1 to add features just to branch-1, which has been mentioned as undesirable. By the same token, backporting any code in branch-2 will require an increasing amount of effort, which contributors to branch-2 may not be interested in committing to. These questions affect us directly because, while we require a certain amount of stability, we also like to pull in new functionality that will be of value to our users. For example, our current 0.13 release is probably closer to 0.14 at this point. Given the lifespan of a release, it is often more palatable to backport features and bugfixes than to jump to a new version. The good thing about this proposal is the opportunity to evaluate and clean up alot of the old code. Thanks, chris On Monday, May 18, 2015 11:48 AM, Sergey Shelukhin ser...@hortonworks.com ser...@hortonworks.com wrote: Note: by “cannot” I mean “are unwilling to”; upgrade paths exist, but some people are set in their ways or have practical considerations and don’t care for new shiny stuff. Sergey Shelukhin ser...@hortonworks.com May 18, 2015 at 11:47 Note: by “cannot” I mean “are unwilling to”; upgrade paths exist, but some people are set in their ways or have practical considerations and don’t care for new shiny stuff. Sergey Shelukhin ser...@hortonworks.com May 18, 2015 at 11:46 I think we need some path for deprecating old Hadoop versions, the same way we deprecate old Java version support or old RDBMS version support. At some point the cost of supporting Hadoop 1 exceeds the benefit. Same goes for stuff like MR; supporting it, esp. for perf work, becomes a burden, and it’s outdated with 2 alternatives, one of which has been around for 2 releases. The branches are a graceful way to get rid of the
Re: [DISCUSS] Supporting Hadoop-1 and experimental features
I don't think anyone is advocating for option 2, as that would be disastrous. Option 3 is closest to what I'm proposing, though again dropping support for Hadoop 1 is only a part of it. Alan. Alexander Pivovarov mailto:apivova...@gmail.com May 22, 2015 at 10:03 Looks like we discussing 3 options: 1. Support hadoop 1, 2 and 3 in master branch. 2. Support hadoop 1 in branch-1, hadoop 2 in branch-2, hadoop 3 in branch-3 3. Support hadoop 2 and 3 in master I DO not think option 2 is good solution because it is much more difficuilt to manage 3 active prod branches rather than one master branch. I think we should go with options 1 or 3. +1 on Xuefu and Edward opinion Sergey Shelukhin mailto:ser...@hortonworks.com May 22, 2015 at 9:08 I think branch-2 doesn’t need to be framed as particularly adventurous (other than due to general increase of the amount of work done in Hive by community). All the new features that normally go on trunk/master will go to branch-2. branch-2 is just trunk as it is now, in fact there will be no branch-2, just master :) The difference is the dropped functionality, not added one. So you shouldn’t lose stability if you retain the same process as now by just staying on versions off master. Perhaps, as is usually the case in Apache projects, developing features on older branches would be discouraged. Right now, all features usually go on trunk/master, and are then back ported as needed and practical; so you wouldn’t (in Apache) make a feature on Hive 0.14 to be released in 0.14.N, and not back port to master. Chris Drome mailto:cdr...@yahoo-inc.com.INVALID May 22, 2015 at 0:49 I understand the motivation and benefits of creating a branch-2 where more disruptive work can go on without affecting branch-1. While not necessarily against this approach, from Yahoo's standpoint, I do have some questions (concerns). Upgrading to a new version of Hive requires a significant commitment of time and resources to stabilize and certify a build for deployment to our clusters. Given the size of our clusters and scale of datasets, we have to be particularly careful about adopting new functionality. However, at the same time we are interested in new testing and making available new features and functionality. That said, we would have to rely on branch-1 for the immediate future. One concern is that branch-1 would be left to stagnate, at which point there would be no option but for users to move to branch-2 as branch-1 would be effectively end-of-lifed. I'm not sure how long this would take, but it would eventually happen as a direct result of the very reason for creating branch-2. A related concern is how disruptive the code changes will be in branch-2. I imagine that changes in early in branch-2 will be easy to backport to branch-1, while this effort will become more difficult, if not impractical, as time goes. If the code bases diverge too much then this could lead to more pressure for users of branch-1 to add features just to branch-1, which has been mentioned as undesirable. By the same token, backporting any code in branch-2 will require an increasing amount of effort, which contributors to branch-2 may not be interested in committing to. These questions affect us directly because, while we require a certain amount of stability, we also like to pull in new functionality that will be of value to our users. For example, our current 0.13 release is probably closer to 0.14 at this point. Given the lifespan of a release, it is often more palatable to backport features and bugfixes than to jump to a new version. The good thing about this proposal is the opportunity to evaluate and clean up alot of the old code. Thanks, chris On Monday, May 18, 2015 11:48 AM, Sergey Shelukhin ser...@hortonworks.com wrote: Note: by “cannot” I mean “are unwilling to”; upgrade paths exist, but some people are set in their ways or have practical considerations and don’t care for new shiny stuff. Sergey Shelukhin mailto:ser...@hortonworks.com May 18, 2015 at 11:47 Note: by “cannot” I mean “are unwilling to”; upgrade paths exist, but some people are set in their ways or have practical considerations and don’t care for new shiny stuff. Sergey Shelukhin mailto:ser...@hortonworks.com May 18, 2015 at 11:46 I think we need some path for deprecating old Hadoop versions, the same way we deprecate old Java version support or old RDBMS version support. At some point the cost of supporting Hadoop 1 exceeds the benefit. Same goes for stuff like MR; supporting it, esp. for perf work, becomes a burden, and it’s outdated with 2 alternatives, one of which has been around for 2 releases. The branches are a graceful way to get rid of the legacy burden. Alternatively, when sweeping changes are made, we can do what Hbase did (which is not pretty imho), where 0.94 version had ~30 dot releases because people cannot upgrade to 0.96 “singularity” release. I posit that people who run
Re: [DISCUSS] Supporting Hadoop-1 and experimental features
Thanks for your feedback Chris. It sounds like there are a couple of reasonable concerns being voiced repeatedly: 1) Fragmentation, the two branches will drift too far apart. 2) Stagnation, branch-1 will effectively become a dead-end. So I modify the proposal as follows to deal with those: 1) New features must be put into master. Whether to put them into branch-1 is at the discretion of the developer. The exception would be features that would not apply in master (e.g. say someone developed a way to double the speed of map reduce jobs Hive produces). For example, I might choose to put the materialized view work I'm doing in both branch-1 and master, but the HBase metastore work only in master. This should avoid fragmentation by keeping branch-1 a subset of master. 2) For the next 12 months we will port critical bug fixes (crashes, security issues, wrong results) to branch-1 as well as fixing them on master. We might choose to lengthen this time depending on how stable master is and how fast the uptake is. This avoids branch-1 being immediately abandoned by developers while users are still depending on it. Alan. Chris Drome mailto:cdr...@yahoo-inc.com.INVALID May 22, 2015 at 0:49 I understand the motivation and benefits of creating a branch-2 where more disruptive work can go on without affecting branch-1. While not necessarily against this approach, from Yahoo's standpoint, I do have some questions (concerns). Upgrading to a new version of Hive requires a significant commitment of time and resources to stabilize and certify a build for deployment to our clusters. Given the size of our clusters and scale of datasets, we have to be particularly careful about adopting new functionality. However, at the same time we are interested in new testing and making available new features and functionality. That said, we would have to rely on branch-1 for the immediate future. One concern is that branch-1 would be left to stagnate, at which point there would be no option but for users to move to branch-2 as branch-1 would be effectively end-of-lifed. I'm not sure how long this would take, but it would eventually happen as a direct result of the very reason for creating branch-2. A related concern is how disruptive the code changes will be in branch-2. I imagine that changes in early in branch-2 will be easy to backport to branch-1, while this effort will become more difficult, if not impractical, as time goes. If the code bases diverge too much then this could lead to more pressure for users of branch-1 to add features just to branch-1, which has been mentioned as undesirable. By the same token, backporting any code in branch-2 will require an increasing amount of effort, which contributors to branch-2 may not be interested in committing to. These questions affect us directly because, while we require a certain amount of stability, we also like to pull in new functionality that will be of value to our users. For example, our current 0.13 release is probably closer to 0.14 at this point. Given the lifespan of a release, it is often more palatable to backport features and bugfixes than to jump to a new version. The good thing about this proposal is the opportunity to evaluate and clean up alot of the old code. Thanks, chris On Monday, May 18, 2015 11:48 AM, Sergey Shelukhin ser...@hortonworks.com wrote: Note: by “cannot” I mean “are unwilling to”; upgrade paths exist, but some people are set in their ways or have practical considerations and don’t care for new shiny stuff. Sergey Shelukhin mailto:ser...@hortonworks.com May 18, 2015 at 11:47 Note: by “cannot” I mean “are unwilling to”; upgrade paths exist, but some people are set in their ways or have practical considerations and don’t care for new shiny stuff. Sergey Shelukhin mailto:ser...@hortonworks.com May 18, 2015 at 11:46 I think we need some path for deprecating old Hadoop versions, the same way we deprecate old Java version support or old RDBMS version support. At some point the cost of supporting Hadoop 1 exceeds the benefit. Same goes for stuff like MR; supporting it, esp. for perf work, becomes a burden, and it’s outdated with 2 alternatives, one of which has been around for 2 releases. The branches are a graceful way to get rid of the legacy burden. Alternatively, when sweeping changes are made, we can do what Hbase did (which is not pretty imho), where 0.94 version had ~30 dot releases because people cannot upgrade to 0.96 “singularity” release. I posit that people who run Hadoop 1 and MR at this day and age (and more so as time passes) are people who either don’t care about perf and new features, only stability; so, stability-focused branch would be perfect to support them. Edward Capriolo mailto:edlinuxg...@gmail.com May 18, 2015 at 10:04 Up until recently Hive supported numerous versions of Hadoop code base with a simple shim layer. I would rather we stick to the shim layer. I