[jira] [Created] (HIVE-25150) Tab characters are not removed before decimal conversion similar to space character which is fixed as part of HIVE-24378
Taraka Rama Rao Lethavadla created HIVE-25150: - Summary: Tab characters are not removed before decimal conversion similar to space character which is fixed as part of HIVE-24378 Key: HIVE-25150 URL: https://issues.apache.org/jira/browse/HIVE-25150 Project: Hive Issue Type: Bug Components: Hive Affects Versions: 4.0.0 Reporter: Taraka Rama Rao Lethavadla Test case: column values with space and tab character bash-4.2$ cat data/files/test_dec_space.csv 1,0 2, 1 3, 2 {noformat} create external table test_dec_space (id int, value decimal) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' location '/tmp/test_dec_space'; {noformat} output of select * from test_dec_space would be {noformat} 1 0 2 1 3 NULL{noformat} The behaviour in MySQL when there is tab & space characters in decimal values bash-4.2$ cat /tmp/insert.csv "1","aa",11.88 "2","bb", 99.88 "4","dd", 209.88 MariaDB [test]> load data local infile '/tmp/insert.csv' into table t2 fields terminated by ',' ENCLOSED BY '"' LINES TERMINATED BY '\n'; Query OK, 3 rows affected, 3 warnings (0.00 sec) Records: 3 Deleted: 0 Skipped: 0 Warnings: 3 MariaDB [test]> select * from t2; +--+--+---+ | id | name | score | +--+--+---+ | 1| aa |12 | | 2| bb | 100 | | 4| dd | 210 | +--+--+---+ 3 rows in set (0.00 sec) So hive should not show up NULL -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-26368) DISTINCT keyword from Count UDF is removed from query plan when CBO is enabled
Taraka Rama Rao Lethavadla created HIVE-26368: - Summary: DISTINCT keyword from Count UDF is removed from query plan when CBO is enabled Key: HIVE-26368 URL: https://issues.apache.org/jira/browse/HIVE-26368 Project: Hive Issue Type: Bug Reporter: Taraka Rama Rao Lethavadla {*}Reproduction steps{*}: _cat ql/src/test/queries/clientpositive/ctas_distinct.q_ {noformat} create database test_db; create table test_db.test_tb (col1 string, col2 int, col3 int, col4 date); insert into test_db.test_tb values('a',1,2,'2022-01-01'); insert into test_db.test_tb values('a',11,2,'2022-01-02'); insert into test_db.test_tb values('a',1,2,'2022-01-01'); insert into test_db.test_tb values('a',1,22,'2022-01-02'); insert into test_db.test_tb values('a',11,2,'2022-01-01'); set hive.log.explain.output=true; create table test as SELECT col1 , col2 , COUNT(DISTINCT col3, col4) AS susp_visit_count FROM test_db.test_tb GROUP BY col1 , col2;{noformat} When we run the above test case, {noformat} mvn install -Pitests -pl itests/qtest -Dtest=TestMiniLlapLocalCliDriver -Dqfile=ctas_distinct.q -Dtest.output.overwrite{noformat} the below exception is thrown {noformat} 2022-07-04T09:22:02,949 ERROR [76039186-5579-4a9b-b787-6d92083f1bb9 main] parse.CalcitePlanner: CBO failed, skipping CBO. org.apache.hadoop.hive.ql.exec.UDFArgumentException: DISTINCT keyword must be specified at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCount.getEvaluator(GenericUDAFCount.java:73) ~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT] at org.apache.hadoop.hive.ql.exec.FunctionRegistry.getGenericUDAFEvaluator(FunctionRegistry.java:1255) ~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT] at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getGenericUDAFEvaluator2(SemanticAnalyzer.java:4974) ~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT] at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getGenericUDAFEvaluator(SemanticAnalyzer.java:4966) ~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT] at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genGroupByPlanMapGroupByOperator(SemanticAnalyzer.java:5651) ~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT] at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genGroupByPlanMapAggrNoSkew(SemanticAnalyzer.java:6578) ~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT] at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:11077) ~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT] at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:12009) ~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT] at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11875) ~[hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT] at org.apache.hadoop.hive.ql.parse.CalcitePlanner.genOPTree(CalcitePlanner.java:631) [hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT] at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:12714) [hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT] at org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:460) [hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT] at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:317) [hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT] at org.apache.hadoop.hive.ql.Compiler.analyze(Compiler.java:224) [hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT] at org.apache.hadoop.hive.ql.Compiler.compile(Compiler.java:106) [hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT] at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:507) [hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT] at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:459) [hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT] at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:424) [hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT] at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:418) [hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT] at org.apache.hadoop.hive.ql.reexec.ReExecDriver.compileAndRespond(ReExecDriver.java:121) [hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT] at org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:227) [hive-exec-4.0.0-alpha-2-SNAPSHOT.jar:4.0.0-alpha-2-SNAPSHOT] at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:255) [hive-cli-4.0.0-alpha-2-SNAPSHOT.jar:?] at
[jira] [Created] (HIVE-26388) ClassCastException when there is decimal type column in source table of CTAS query
Taraka Rama Rao Lethavadla created HIVE-26388: - Summary: ClassCastException when there is decimal type column in source table of CTAS query Key: HIVE-26388 URL: https://issues.apache.org/jira/browse/HIVE-26388 Project: Hive Issue Type: Bug Reporter: Taraka Rama Rao Lethavadla Steps to reproduce cat ql/src/test/queries/clientpositive/ctas_open_csv_serde.q {noformat} create table T1(abc decimal(10,0)); insert into table T1 values(1.25); create table T2 row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde' with serdeproperties ("separatorChar" = ',' , "quoteChar" = '"') stored as textfile as select * from T1;{noformat} Then execute the test case with below command {noformat} mvn install -Pitests -pl itests/qtest -Dtest=TestMiniLlapLocalCliDriver -Dqfile=ctas_open_csv_serde.q -Dtest.output.overwrite{noformat} Exception trace looks like below {noformat} [ERROR] TestMiniLlapLocalCliDriver.testCliDriver:62 Client execution failed with error code = 2 running create table T2 row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde' with serdeproperties ("separatorChar" = ',' , "quoteChar" = '"') stored as textfile as select * from T1 fname=ctas_open_csv_serde.q See ./ql/target/tmp/log/hive.log or ./itests/qtest/target/tmp/log/hive.log, or check ./ql/target/surefire-reports or ./itests/qtest/target/surefire-reports/ for specific test cases logs. org.apache.hadoop.hive.ql.metadata.HiveException: Vertex failed, vertexName=Map 1, vertexId=vertex_1657718574697_0001_2_00, diagnostics=[Task failed, taskId=task_1657718574697_0001_2_00_00, diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( failure ) : attempt_1657718574697_0001_2_00_00_0:java.lang.RuntimeException: java.lang.RuntimeException: Hive Runtime Error while closing operators at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:348) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:276) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:381) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:82) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:69) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:69) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:39) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at org.apache.hadoop.hive.llap.daemon.impl.StatsRecordingThreadPool$WrappedCallable.call(StatsRecordingThreadPool.java:118) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.RuntimeException: Hive Runtime Error while closing operators at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.close(MapRecordProcessor.java:483) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:310) ... 15 more Caused by: java.lang.ClassCastException: org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableHiveDecimalObjectInspector cannot be cast to org.apache.hadoop.hive.serde2.objectinspector.primitive.StringObjectInspector at org.apache.hadoop.hive.serde2.OpenCSVSerde.serialize(OpenCSVSerde.java:119) at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:1116) at org.apache.hadoop.hive.ql.exec.vector.VectorFileSinkOperator.process(VectorFileSinkOperator.java:111) at org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:931) at org.apache.hadoop.hive.ql.exec.vector.VectorSelectOperator.process(VectorSelectOperator.java:158) at org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:919) at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:171) at org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.closeOp(VectorMapOperator.java:1010) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:686) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.close(MapRecordProcessor.java:459) ... 16 more ], TaskAttempt 1 failed, info=[Error: Error while running task ( failure ) : attempt_1657718574697_0001_2_00_00_1:java.lang.RuntimeException: java.lang.RuntimeException: Hive Runtime Error while closing
[jira] [Created] (HIVE-26889) Implement array_join udf to concatenate the elements of an array with a specified delimiter
Taraka Rama Rao Lethavadla created HIVE-26889: - Summary: Implement array_join udf to concatenate the elements of an array with a specified delimiter Key: HIVE-26889 URL: https://issues.apache.org/jira/browse/HIVE-26889 Project: Hive Issue Type: Sub-task Components: Hive Reporter: Taraka Rama Rao Lethavadla -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-26908) Disable Initiator on HMS instance at the same time enable Cleaner thread
Taraka Rama Rao Lethavadla created HIVE-26908: - Summary: Disable Initiator on HMS instance at the same time enable Cleaner thread Key: HIVE-26908 URL: https://issues.apache.org/jira/browse/HIVE-26908 Project: Hive Issue Type: New Feature Components: Standalone Metastore Reporter: Taraka Rama Rao Lethavadla Assignee: Taraka Rama Rao Lethavadla In the current implementation, both Initiator and Cleaner are either enabled or disabled using the same config {noformat} hive.compactor.initiator.on{noformat} So there is no way to selectively disable initiator and enable cleaner or vice versa. Introducing another config to handle Cleaner thread alone like hive.compactor.cleaner.on -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-26898) Split Notification logging so that we can busy clusters can have better performance
Taraka Rama Rao Lethavadla created HIVE-26898: - Summary: Split Notification logging so that we can busy clusters can have better performance Key: HIVE-26898 URL: https://issues.apache.org/jira/browse/HIVE-26898 Project: Hive Issue Type: New Feature Reporter: Taraka Rama Rao Lethavadla For DDL & DML events are logged into notifications log table and those get cleaned as soon as ttl got expired. In most of the busy clusters, the notification log is growing even though cleaner is running and kept on cleaning the events. It means the rate of Hive db operations are very high compared to rate at which cleaning is happening. So any query on this table is becoming bottle neck at backend DB causing slow response The proposal is to split the notification log table in to multiple tables like notification_log_dml - for all DML queries notification_log_insert - for all insert queries .. etc. So that load on that single table gets reduced improving the performance of the backend db as well as Hive -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-26897) Provide a command/tool to recover data in ACID table when table data got corrupted with invalid/junk delta/delete_delta folders
Taraka Rama Rao Lethavadla created HIVE-26897: - Summary: Provide a command/tool to recover data in ACID table when table data got corrupted with invalid/junk delta/delete_delta folders Key: HIVE-26897 URL: https://issues.apache.org/jira/browse/HIVE-26897 Project: Hive Issue Type: New Feature Reporter: Taraka Rama Rao Lethavadla Example: A table has below directories {noformat} drwx-- - hive hive 0 2022-11-05 19:43 /data/warehouse/tbl/delete_delta_0080483_0087704_v0973185 drwx-- - hive pdl_prod_nosh_jsin 0 2022-12-05 00:18 /data/warehouse/tbl/delete_delta_0080483_0088384_v507{noformat} When we read data from this table, we get below errors {noformat} java.util.concurrent.ExecutionException: java.lang.IllegalStateException: Duplicate key null (attempted merging values org.apache.hadoop.hive.ql.io.AcidInputFormat$DeltaFileMetaData@41776cd9 and org.apache.hadoop.hive.ql.io.AcidInputFormat$DeltaFileMetaData@1404a054){noformat} delete_delta_0080483_0087704_v0973185,delete_delta_0080483_0088384_v507 are created as part of minor compaction. In general, once minor compaction completed, the next minor compaction picks min_writeId value as greater than the value of the previously compacted max_writeId. In this case for both the minor compacted directories could see min_writeId is the same (i.e. 0080483). To mitigate the issue, we had to remove those directories manually from hdfs, then create a fresh table out of it, drop the actual table and rename fresh table to actual table *Proposal* Create a tool/command to read the data from the corrupted ACID table to recover data out of it before we make any changes to the underlying data. So that we can workaround the problem by creating another table with same data -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-26827) Add configs to workaround predicate issue with Parquet on TIMESTAMP data type
Taraka Rama Rao Lethavadla created HIVE-26827: - Summary: Add configs to workaround predicate issue with Parquet on TIMESTAMP data type Key: HIVE-26827 URL: https://issues.apache.org/jira/browse/HIVE-26827 Project: Hive Issue Type: Improvement Reporter: Taraka Rama Rao Lethavadla Assignee: Taraka Rama Rao Lethavadla The below query fails with error {noformat} ect * from db.parquet_table_with_timestamp where created_date_utc between '2022-11-05 00:01:01' and '2022-11-08 23:59:59'{noformat} We can workaround the issue below {noformat} 2022-11-10 06:43:36,751 [ERROR] [TezChild] |read.ParquetFilterPredicateConverter|: fail to build predicate filter leaf with errors org.apache.hadoop.hive.ql.metadata.HiveException: Conversion to Parquet FilterPredicate not supported for TIMESTAMP{noformat} by setting configs at session level # set hive.optimize.index.filter=false; # set hive.optimize.ppd=false; As part of this Jira proposing to add these config info to the above message so that who ever encounter this problem can try the workaround -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-26833) Update error message related to dynamic partitioning with workaround
Taraka Rama Rao Lethavadla created HIVE-26833: - Summary: Update error message related to dynamic partitioning with workaround Key: HIVE-26833 URL: https://issues.apache.org/jira/browse/HIVE-26833 Project: Hive Issue Type: Improvement Reporter: Taraka Rama Rao Lethavadla Assignee: Taraka Rama Rao Lethavadla We see below error message as part of insert query where Dynamic Partition is expected to happen {noformat} Caused by: org.apache.hadoop.hive.ql.metadata.HiveFatalException: [Error 20004]: Fatal error occurred when node tried to create too many dynamic partitions. The maximum number of dynamic partitions is controlled by hive.exec.max.dynamic.partitions and hive.exec.max.dynamic.partitions.pernode. Maximum was set to 2000 partitions per node, number of dynamic partitions on this node: 2001{noformat} As part of this jira, planning to add another workaround like splitting the query into multiple queries based on a filter condition -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-26774) Implement array_slice UDF to get the subset of elements from an array (subarray)
Taraka Rama Rao Lethavadla created HIVE-26774: - Summary: Implement array_slice UDF to get the subset of elements from an array (subarray) Key: HIVE-26774 URL: https://issues.apache.org/jira/browse/HIVE-26774 Project: Hive Issue Type: Sub-task Components: Hive Reporter: Taraka Rama Rao Lethavadla Assignee: Taraka Rama Rao Lethavadla -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-26792) Add index to PARTITION_KEY_VALS table to speed up partition operations
Taraka Rama Rao Lethavadla created HIVE-26792: - Summary: Add index to PARTITION_KEY_VALS table to speed up partition operations Key: HIVE-26792 URL: https://issues.apache.org/jira/browse/HIVE-26792 Project: Hive Issue Type: Improvement Components: Hive Reporter: Taraka Rama Rao Lethavadla Assignee: Taraka Rama Rao Lethavadla Better performance can be achieved for partition related queries when we have an index like KEY `idx_PART_KEY_VAL` (`PART_KEY_VAL`) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-26744) Add array_max and array_min UDFs
Taraka Rama Rao Lethavadla created HIVE-26744: - Summary: Add array_max and array_min UDFs Key: HIVE-26744 URL: https://issues.apache.org/jira/browse/HIVE-26744 Project: Hive Issue Type: Sub-task Components: Hive Reporter: Taraka Rama Rao Lethavadla Assignee: Taraka Rama Rao Lethavadla -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-26754) Implement array_distinct UDF to return an array after removing duplicates in it
Taraka Rama Rao Lethavadla created HIVE-26754: - Summary: Implement array_distinct UDF to return an array after removing duplicates in it Key: HIVE-26754 URL: https://issues.apache.org/jira/browse/HIVE-26754 Project: Hive Issue Type: Sub-task Components: Hive Reporter: Taraka Rama Rao Lethavadla Assignee: Taraka Rama Rao Lethavadla -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-26784) Provide an option to skip malformed json entries and process the rest of the data in case JSON Serde
Taraka Rama Rao Lethavadla created HIVE-26784: - Summary: Provide an option to skip malformed json entries and process the rest of the data in case JSON Serde Key: HIVE-26784 URL: https://issues.apache.org/jira/browse/HIVE-26784 Project: Hive Issue Type: Improvement Components: Hive Reporter: Taraka Rama Rao Lethavadla We may want to create hive tables on top of audit data generated by ranger (not only HDFS audit but other services as well) to be able to analyze the same. Based on this case, some audit records may become corrupt. Improving the hive built-in json serde to provide an option (in tblproperties and via session configurations for example) to allow ignoring malformed records (or to provide empty rows) would assist users achieve this goal without using third-party/custom serdes, for example: #see ignore.malformed.json [https://aws.amazon.com/premiumsupport/knowledge-center/error-json-athena/#:~:text=you%20can%20ignore%20malformed%20records] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-26952) set the value of metastore.storage.schema.reader.impl to org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader as default
Taraka Rama Rao Lethavadla created HIVE-26952: - Summary: set the value of metastore.storage.schema.reader.impl to org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader as default Key: HIVE-26952 URL: https://issues.apache.org/jira/browse/HIVE-26952 Project: Hive Issue Type: Improvement Components: Standalone Metastore Reporter: Taraka Rama Rao Lethavadla Assignee: Taraka Rama Rao Lethavadla With the default value of {code:java} DefaultStorageSchemaReader.class.getName(){code} in the Metastore Config, *metastore.storage.schema.reader.impl* below exception is thrown when trying to read Avro schema {noformat} Caused by: org.apache.hive.service.cli.HiveSQLException: MetaException (message:java.lang.UnsupportedOperationException: Storage schema reading not supported) at org.apache.hive.service.cli.operation.GetColumnsOperation.runInternal(GetColumnsOperation.java:213) at org.apache.hive.service.cli.operation.Operation.run(Operation.java:247) at org.apache.hive.service.cli.session.HiveSessionImpl.getColumns(HiveSessionImpl.java:729) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78) at org.apache.hive.service.cli.session.HiveSessionProxy.access-zsh(HiveSessionProxy.java:36) at org.apache.hive.service.cli.session.HiveSessionProxy.run(HiveSessionProxy.java:63) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59) at com.sun.proxy..getColumns(Unknown Source) at org.apache.hive.service.cli.CLIService.getColumns(CLIService.java:390){noformat} setting the above config with *org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader* resolves issue Proposing to make this value as default in code base, so that in upcoming versions we don't have to set this value manually -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-26955) Alter table change column data type of a Parquet table throws exception
Taraka Rama Rao Lethavadla created HIVE-26955: - Summary: Alter table change column data type of a Parquet table throws exception Key: HIVE-26955 URL: https://issues.apache.org/jira/browse/HIVE-26955 Project: Hive Issue Type: Bug Components: HiveServer2 Reporter: Taraka Rama Rao Lethavadla Steps to reproduce {noformat} create table test_parquet (id decimal) stored as parquet; insert into test_parquet values(238); alter table test_parquet change id id string; select * from test_parquet; Error: java.io.IOException: org.apache.parquet.io.ParquetDecodingException: Can not read value at 1 in block 0 in file hdfs:/namenode:8020/warehouse/tablespace/managed/hive/test_parquet/delta_001_001_/00_0 (state=,code=0) at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:624) at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:531) at org.apache.hadoop.hive.ql.exec.FetchTask.executeInner(FetchTask.java:194) ... 55 more Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 1 in block 0 in file file:/home/centos/Apache-Hive-Tarak/itests/qtest/target/localfs/warehouse/test_parquet/00_0 at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:255) at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:87) at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:89) at org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:771) at org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:335) at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:562) ... 57 more Caused by: java.lang.ClassCastException: org.apache.hadoop.hive.serde2.typeinfo.PrimitiveTypeInfo cannot be cast to org.apache.hadoop.hive.serde2.typeinfo.DecimalTypeInfo at org.apache.hadoop.hive.ql.io.parquet.convert.ETypeConverter$8$5.convert(ETypeConverter.java:669) at org.apache.hadoop.hive.ql.io.parquet.convert.ETypeConverter$8$5.convert(ETypeConverter.java:664) at org.apache.hadoop.hive.ql.io.parquet.convert.ETypeConverter$BinaryConverter.addBinary(ETypeConverter.java:977) at org.apache.parquet.column.impl.ColumnReaderBase$2$6.writeValue(ColumnReaderBase.java:360) at org.apache.parquet.column.impl.ColumnReaderBase.writeCurrentValueToConverter(ColumnReaderBase.java:410) at org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:30) at org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:406) at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:230) ... 63 more{noformat} However the same is working as expected in ORC table {noformat} create table test_orc (id decimal) stored as orc; insert into test_orc values(238); alter table test_orc change id id string; select * from test_orc; +--+ | test_orc.id | +--+ | 238 | +--+{noformat} As well as text table {noformat} create table test_text (id decimal) stored as textfile; insert into test_text values(238); alter table test_text change id id string; select * from test_text; +---+ | test_text.id | +---+ | 238 | +---+{noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-26700) alter view set owner is throwing exception
Taraka Rama Rao Lethavadla created HIVE-26700: - Summary: alter view set owner is throwing exception Key: HIVE-26700 URL: https://issues.apache.org/jira/browse/HIVE-26700 Project: Hive Issue Type: Bug Components: HiveServer2 Reporter: Taraka Rama Rao Lethavadla Tested in 3.1, we can alter the view with alter table command itself desc formatted test_view_owner; {noformat} +---++---+ | col_name | data_type | comment | +---++---+ | # col_name | data_type | comment | | id | int | | | name | string | | | | NULL | NULL | | # Detailed Table Information | NULL | NULL | | Database: | default | NULL | | OwnerType: | USER | NULL | | Owner: | hive | NULL | | CreateTime: | Tue Oct 25 20:40:47 GMT+08:00 2022 | NULL | | LastAccessTime: | UNKNOWN | NULL | | Retention: | 0 | NULL | | Table Type: | VIRTUAL_VIEW | NULL | | Table Parameters: | NULL | NULL | | | bucketing_version | 2 | | | transient_lastDdlTime | 1666701647 | | | NULL | NULL | | # Storage Information | NULL | NULL | | SerDe Library: | null | NULL | | InputFormat: | org.apache.hadoop.hive.ql.io.orc.OrcInputFormat | NULL | | OutputFormat: | org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat | NULL | | Compressed: | No | NULL | | Num Buckets: | -1 | NULL | | Bucket Columns: | [] | NULL | | Sort Columns: | [] | NULL | | | NULL | NULL | | # View Information | NULL | NULL | | Original Query: | select * from test_owner_hive | NULL | | Expanded Query: | select `test_owner_hive`.`id`, `test_owner_hive`.`name` from `default`.`test_owner_hive` | NULL | +---++---+{noformat} alter table test_view_owner set owner user tarak; desc formatted test_view_owner; {noformat} +---++---+ | col_name | data_type | comment | +---++---+ | # col_name | data_type | comment | | id | int | | | name | string | | | | NULL |
[jira] [Created] (HIVE-26703) Group queries between HMS and backend database and make them available
Taraka Rama Rao Lethavadla created HIVE-26703: - Summary: Group queries between HMS and backend database and make them available Key: HIVE-26703 URL: https://issues.apache.org/jira/browse/HIVE-26703 Project: Hive Issue Type: New Feature Components: Standalone Metastore Reporter: Taraka Rama Rao Lethavadla While troubleshooting HMS slowness issue, we have to rely on logs to see which component(compaction/stats/client connections) are causing issue. The proposal is to group the queries executed by HMS into component wise and expose number of queries each component is making at a given point in time like Compaction - 15 queries Client/HS2 - 4 queries Msck - 10 queries stats - 80 queries So from the above numbers I could say that stats is consuming more connections/resources and try any action to reduce it -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-26701) Enable metrics for Database connection pools(1 & 2) used by ObjectStore in HMS
Taraka Rama Rao Lethavadla created HIVE-26701: - Summary: Enable metrics for Database connection pools(1 & 2) used by ObjectStore in HMS Key: HIVE-26701 URL: https://issues.apache.org/jira/browse/HIVE-26701 Project: Hive Issue Type: Bug Components: Standalone Metastore Reporter: Taraka Rama Rao Lethavadla We have metrics enabled for database connection pools(3 & 4) used in TxnHandler. We don't have the same for pools(1 & 2) used by ObjectStore -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-26729) Add new UDFs to process Array type of data
Taraka Rama Rao Lethavadla created HIVE-26729: - Summary: Add new UDFs to process Array type of data Key: HIVE-26729 URL: https://issues.apache.org/jira/browse/HIVE-26729 Project: Hive Issue Type: Improvement Reporter: Taraka Rama Rao Lethavadla Assignee: Taraka Rama Rao Lethavadla Umbrella Jira to track new UDFs related to Arrays -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-27126) Display Yarn resources availability in real time along with Tez job progress
Taraka Rama Rao Lethavadla created HIVE-27126: - Summary: Display Yarn resources availability in real time along with Tez job progress Key: HIVE-27126 URL: https://issues.apache.org/jira/browse/HIVE-27126 Project: Hive Issue Type: Improvement Components: HiveServer2 Reporter: Taraka Rama Rao Lethavadla Right we don't know if the queue used in Hive query execution is having enough resources or not. We are already displaying tez job details like number of task per each vertex and how are they progressing. If the resources available are not good enough to execute the query in parallel or query that use to take shorter time is taking time, then we can use this new detail to understand that queue is busy and resources are not enough {noformat} -- VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED -- Map 1 container RUNNING 230 0 10 230 0 1 Reducer 2 container INITED 1 0 0 1 0 0 -- VERTICES: 00/02 [>>--] 0% ELAPSED TIME: 58.33 s -- {noformat} So it would be great to display current resource availability in that specific queue along with above info in real time would help understand why a query can be slow -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-27163) Column stats not getting published after an insert query into an external table with custom location
Taraka Rama Rao Lethavadla created HIVE-27163: - Summary: Column stats not getting published after an insert query into an external table with custom location Key: HIVE-27163 URL: https://issues.apache.org/jira/browse/HIVE-27163 Project: Hive Issue Type: Bug Components: Hive Reporter: Taraka Rama Rao Lethavadla Test case details are below *test.q* {noformat} set hive.stats.column.autogather=true; set hive.stats.autogather=true; dfs ${system:test.dfs.mkdir} ${system:test.tmp.dir}/test; create external table test_custom(age int, name string) stored as orc location '/tmp/test'; insert into test_custom select 1, 'test'; desc formatted test_custom age;{noformat} *test.q.out* {noformat} A masked pattern was here PREHOOK: type: CREATETABLE A masked pattern was here PREHOOK: Output: database:default PREHOOK: Output: default@test_custom A masked pattern was here POSTHOOK: type: CREATETABLE A masked pattern was here POSTHOOK: Output: database:default POSTHOOK: Output: default@test_custom PREHOOK: query: insert into test_custom select 1, 'test' PREHOOK: type: QUERY PREHOOK: Input: _dummy_database@_dummy_table PREHOOK: Output: default@test_custom POSTHOOK: query: insert into test_custom select 1, 'test' POSTHOOK: type: QUERY POSTHOOK: Input: _dummy_database@_dummy_table POSTHOOK: Output: default@test_custom POSTHOOK: Lineage: test_custom.age SIMPLE [] POSTHOOK: Lineage: test_custom.name SIMPLE [] PREHOOK: query: desc formatted test_custom age PREHOOK: type: DESCTABLE PREHOOK: Input: default@test_custom POSTHOOK: query: desc formatted test_custom age POSTHOOK: type: DESCTABLE POSTHOOK: Input: default@test_custom col_name age data_type int min max num_nulls distinct_count avg_col_len max_col_len num_trues num_falses bit_vector comment from deserializer{noformat} As we can see from desc formatted output, column stats were not populated -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-27112) implement array_except UDF in Hive
Taraka Rama Rao Lethavadla created HIVE-27112: - Summary: implement array_except UDF in Hive Key: HIVE-27112 URL: https://issues.apache.org/jira/browse/HIVE-27112 Project: Hive Issue Type: Sub-task Reporter: Taraka Rama Rao Lethavadla Assignee: Taraka Rama Rao Lethavadla *array_except(array1, array2)* Returns an array of the elements in {{array1}} but not in {{{}array2{}}}. {noformat} > SELECT array_except(array(1, 2, 2, 3), array(1, 1, 3, 5)); [2] {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-27118) implement array_intersect UDF in Hive
Taraka Rama Rao Lethavadla created HIVE-27118: - Summary: implement array_intersect UDF in Hive Key: HIVE-27118 URL: https://issues.apache.org/jira/browse/HIVE-27118 Project: Hive Issue Type: Sub-task Components: Hive Reporter: Taraka Rama Rao Lethavadla Assignee: Taraka Rama Rao Lethavadla *array_intersect(array1, array2)* {{Returns an array of the elements in the intersection of {{array1}} and {{{}array2{}}}, without duplicates.}} {noformat} > SELECT array_intersect(array(1, 2, 2, 3), array(1, 1, 3, 5)); [1,3] {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-27227) Provide config to re-enable partitions discovery on external tables
Taraka Rama Rao Lethavadla created HIVE-27227: - Summary: Provide config to re-enable partitions discovery on external tables Key: HIVE-27227 URL: https://issues.apache.org/jira/browse/HIVE-27227 Project: Hive Issue Type: Improvement Components: Hive Reporter: Taraka Rama Rao Lethavadla As part of HIVE-25039 disabled discovery.partitions config for external tables by default. Now if someone wants to turn on the feature(knowing the risk) they have to add the config set to true for every newly created table. Another use case is if a user want to enable this feature(knowing the risk) to all the existing tables, he has to execute alter table command for every table. It would be very difficult if they have lot of tables -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-27224) Enhance drop table/partition command
Taraka Rama Rao Lethavadla created HIVE-27224: - Summary: Enhance drop table/partition command Key: HIVE-27224 URL: https://issues.apache.org/jira/browse/HIVE-27224 Project: Hive Issue Type: Improvement Components: Hive, Standalone Metastore Reporter: Taraka Rama Rao Lethavadla {*}Problem Statement{*}: If the table has a large number of partitions, then drop table command will take a lot of time to finish. To improve the command we have the following proposals * Perform all the queries(HMS->DB) in drop table in batches(not just partitions table) so that query will not fail throwing exceptions like transaction id not found or any other timeout issues as this is directly proportional to backend database performance * Display what action is happening as part of drop table, so that user will know what step is taking more time or how many steps completed so far. we should have loggers(DEBUG's at least) in clients to know how many partitions/batches being processed & current iterations to estimate approx. timeout for such large HMS operation. * support retry option, if for some reason drop table command fails performing some of the operations, the next time it is run, it should proceed with next operations instead of failing due to missing/stale entries -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-27222) New functionality to show compactions information for a specific table/partition of a given database
Taraka Rama Rao Lethavadla created HIVE-27222: - Summary: New functionality to show compactions information for a specific table/partition of a given database Key: HIVE-27222 URL: https://issues.apache.org/jira/browse/HIVE-27222 Project: Hive Issue Type: Improvement Components: Hive Reporter: Taraka Rama Rao Lethavadla As per the current implementation, show compactions command will list compaction details of all the partitions and tables of all the databases in a single output. If user happens to have 100 or 1000 or even more databases/tables/partitions, parsing the show compactions output to check details of specific table/partition will be difficult So the proposal is to support something like {noformat} show compactions `db`.`table`[.`partition`]{noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-27072) create an sql query to validate a given table for partitions and list out any discrepancies in files/folders, list out empty files etc
Taraka Rama Rao Lethavadla created HIVE-27072: - Summary: create an sql query to validate a given table for partitions and list out any discrepancies in files/folders, list out empty files etc Key: HIVE-27072 URL: https://issues.apache.org/jira/browse/HIVE-27072 Project: Hive Issue Type: New Feature Components: HiveServer2 Reporter: Taraka Rama Rao Lethavadla There are couple of issues when partitions were corrupted or have additional unwanted files that will intervene query execution and fail. If we run query like "validate table table_name [partition(partition=a,..)]", the output should list * any unwanted files like empty/metadata files(like _SUCCESS etc) * any unwanted folders not confirming to the partition naming convention like test_folder where actual partition name looks like test=23 * Too many staging directories, if we find many then cleanup is not happening properly after query execution * any file permission related issues like table has one owner, partition has another owner etc(Optional) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-27015) Add support to accept client connections with X-CSRF-Token as part of header in http transport mode
Taraka Rama Rao Lethavadla created HIVE-27015: - Summary: Add support to accept client connections with X-CSRF-Token as part of header in http transport mode Key: HIVE-27015 URL: https://issues.apache.org/jira/browse/HIVE-27015 Project: Hive Issue Type: Bug Components: HiveServer2 Affects Versions: 4.0.0 Reporter: Taraka Rama Rao Lethavadla Assignee: Taraka Rama Rao Lethavadla Today in HTTP transport mode, clients needs to send *X-XSRF-HEADER* introduced as part of HIVE-13853 This Jira is about adding support to accept connections from client(ex: Hue) which are sending *X-CSRF-Token* -- This message was sent by Atlassian Jira (v8.20.10#820010)