[jira] [Created] (HIVE-26394) Query based compaction fails for table with more than 6 columns
mahesh kumar behera created HIVE-26394: -- Summary: Query based compaction fails for table with more than 6 columns Key: HIVE-26394 URL: https://issues.apache.org/jira/browse/HIVE-26394 Project: Hive Issue Type: Bug Components: Hive, HiveServer2 Reporter: mahesh kumar behera Assignee: mahesh kumar behera Query based compaction creates a temp external table with location pointing to the location of the table being compacted. So this external table has file of ACID type. When query is done on this table, the table type is decided by reading the files present at the table location. As the table location has files compatible to ACID format, it is assuming it to be ACID table. This is causing issue while generating the SARG columns as the column number does not match with the schema. {code:java} Error doing query based minor compaction org.apache.hadoop.hive.ql.metadata.HiveException: Failed to run INSERT into table delta_cara_pn_tmp_compactor_clean_1656061070392_result select `operation`, `originalTransaction`, `bucket`, `rowId`, `currentTransaction`, `row` from delta_clean_1656061070392 where `originalTransaction` not in (749,750,766,768,779,783,796,799,818,1145,1149,1150,1158,1159,1160,1165,1166,1169,1173,1175,1176,1871,9631) at org.apache.hadoop.hive.ql.DriverUtils.runOnDriver(DriverUtils.java:73) at org.apache.hadoop.hive.ql.txn.compactor.QueryCompactor.runCompactionQueries(QueryCompactor.java:138) at org.apache.hadoop.hive.ql.txn.compactor.MinorQueryCompactor.runCompaction(MinorQueryCompactor.java:70) at org.apache.hadoop.hive.ql.txn.compactor.Worker.findNextCompactionAndExecute(Worker.java:498) at org.apache.hadoop.hive.ql.txn.compactor.Worker.lambda$run$0(Worker.java:120) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: (responseCode = 2, errorMessage = FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, vertexId=vertex_1656061159324__1_00, diagnostics=[Task failed, taskId=task_1656061159324__1_00_00, diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( failure ) : attempt_1656061159324__1_00_00_0:java.lang.RuntimeException: java.lang.RuntimeException: java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: 6 at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:348) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:277) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:75) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:62) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:62) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:38) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at org.apache.hadoop.hive.llap.daemon.impl.StatsRecordingThreadPool$WrappedCallable.call(StatsRecordingThreadPool.java:118) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: java.lang.RuntimeException: java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: 6 at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:206) at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.(TezGroupedSplitsInputFormat.java:145) at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat.getRecordReader(TezGroupedSplitsInputFormat.java:111) at org.apache.tez.mapreduce.lib.MRReaderMapred.setupOldRecordReader(MRReaderMapred.java:164) at org.apache.tez.mapreduce.lib.MRReaderMapred.setSplit(MRReaderMapred.java:83) at org.apache.tez.mapreduce.input.MRInput.initFromEventInternal(MRInput.java:706)
[jira] [Created] (HIVE-26382) Stats generation fails during CTAS for external partitioned table.
mahesh kumar behera created HIVE-26382: -- Summary: Stats generation fails during CTAS for external partitioned table. Key: HIVE-26382 URL: https://issues.apache.org/jira/browse/HIVE-26382 Project: Hive Issue Type: Bug Components: Hive, HiveServer2 Affects Versions: 4.0.0-alpha-1 Reporter: mahesh kumar behera Assignee: mahesh kumar behera As part of HIVE-25990 manifest file was generated to list out the files to be moved. The files are moved in move task by referring to the manifest file. For partitioned table flow, the move is not done. This prevents the dynamic partition creation as the target path will be empty. As stats task needs the partition information, this causes the stat task to fail. {code:java} class="metastore.RetryingHMSHandler" level="ERROR" thread="pool-10-thread-144"] MetaException(message:Unable to update Column stats for ext_par due to: The IN list is empty!) org.apache.hadoop.hive.metastore.DirectSqlUpdateStat.updatePartitionColumnStatistics(DirectSqlUpdateStat.java:634) org.apache.hadoop.hive.metastore.MetaStoreDirectSql.updatePartitionColumnStatisticsBatch(MetaStoreDirectSql.java:2803) org.apache.hadoop.hive.metastore.ObjectStore.updatePartitionColumnStatisticsInBatch(ObjectStore.java:10001) sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43 java.lang.reflect.Method.invoke(Method.java:498) org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:97) com.sun.proxy.$Proxy33.updatePartitionColumnStatisticsInBatch(Unknown Source) org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.updatePartitionColStatsForOneBatch(HiveMetaStore.java:7124) org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.updatePartitionColStatsInBatch(HiveMetaStore.java:7109) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-26222) Native GeoSpatial Support in Hive
mahesh kumar behera created HIVE-26222: -- Summary: Native GeoSpatial Support in Hive Key: HIVE-26222 URL: https://issues.apache.org/jira/browse/HIVE-26222 Project: Hive Issue Type: Task Components: Hive, HiveServer2 Reporter: mahesh kumar behera Assignee: mahesh kumar behera This is an epic Jira to support GeoSpatial datatypes natively in Hive. This will cater to the applications which queries on large volumes of spatial data. This support will be added in a phased manner. To start with, we are planning to make use of framework developed by ESRI ([https://github.com/Esri/spatial-framework-for-hadoop).] This project is not very active and there is no release done to maven central. So its not easy to get the jars downloaded directly using pom dependency. Also the UDFs are based on older version of Hive. So we have decided to make a copy of this repo and maintained it inside Hive. This will make it easier to do any improvement and manage dependencies. As of now, the data loading is done only on a binary data type. We need to enhance this to make it more user friendly. In the next phase, a native Geometry/Geography datatype will be supported. User can directly create a geometry type and operate on it. Apart from these we can start adding support for different indices like quad tree, R-tree, ORC/Parquet/Iceberg support etc. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (HIVE-26105) Show columns shows extra values if column comments contains specific Chinese character
mahesh kumar behera created HIVE-26105: -- Summary: Show columns shows extra values if column comments contains specific Chinese character Key: HIVE-26105 URL: https://issues.apache.org/jira/browse/HIVE-26105 Project: Hive Issue Type: Bug Components: Hive, HiveServer2 Reporter: mahesh kumar behera Assignee: mahesh kumar behera The issue is happening because the UTF code for one of the Chinese character contains the binary value of '\r' (CR). Because of this, the Hadoop line reader (used by fetch task in Hive) is assuming the value after that character as new value and this extra value with junk is getting displayed. The issue is with 0x540D 名 ... The last value is "D" ..that is 13. While reading the result, Hadoop line reader interpreting it as CR ( '\r'). Thus an extra value with Junk is coming as output. For show column, we do not need the comments. So while writing to the file, only column names should be included. [https://github.com/apache/hadoop/blob/0fbd96a2449ec49f840d93e1c7d290c5218ef4ea/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java#L238] {code:java} create table tbl_test (fld0 string COMMENT '期 ' , fld string COMMENT '期末日期', fld1 string COMMENT '班次名称', fld2 string COMMENT '排班人数'); show columns from tbl_test; ++ | field | ++ | fld | | fld0 | | fld1 | | � | | fld2 | ++ 5 rows selected (171.809 seconds) {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HIVE-26098) Duplicate path/Jar in hive.aux.jars.path or hive.reloadable.aux.jars.path causing IllegalArgumentException
mahesh kumar behera created HIVE-26098: -- Summary: Duplicate path/Jar in hive.aux.jars.path or hive.reloadable.aux.jars.path causing IllegalArgumentException Key: HIVE-26098 URL: https://issues.apache.org/jira/browse/HIVE-26098 Project: Hive Issue Type: Bug Components: Hive, HiveServer2 Reporter: mahesh kumar behera Assignee: mahesh kumar behera hive.aux.jars.path and hive.reloadable.aux.jars.path are used for providing auxiliary jars which are used doing query processing. These jars are copied to Tez temp path so that the Tez jobs have access to these jars while processing the job. There are a duplicate check to avoid copying the same jar multiple times. This check assumes the jar to be in local file system. But in real, the jars path can be anywhere. So this duplicate check fails, when the source path is not in local path. {code:java} ERROR : Failed to execute tez graph. java.lang.IllegalArgumentException: Wrong FS: hdfs://localhost:53877/tmp/test_jar/identity_udf.jar, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:781) ~[hadoop-common-3.1.0.jar:?] at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:86) ~[hadoop-common-3.1.0.jar:?] at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:636) ~[hadoop-common-3.1.0.jar:?] at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:930) ~[hadoop-common-3.1.0.jar:?] at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:631) ~[hadoop-common-3.1.0.jar:?] at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:454) ~[hadoop-common-3.1.0.jar:?] at org.apache.hadoop.hive.ql.exec.tez.DagUtils.checkPreExisting(DagUtils.java:1392) ~[hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1] at org.apache.hadoop.hive.ql.exec.tez.DagUtils.localizeResource(DagUtils.java:1411) ~[hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1] at org.apache.hadoop.hive.ql.exec.tez.DagUtils.addTempResources(DagUtils.java:1295) ~[hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1] at org.apache.hadoop.hive.ql.exec.tez.DagUtils.localizeTempFilesFromConf(DagUtils.java:1177) ~[hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1] at org.apache.hadoop.hive.ql.exec.tez.TezSessionState.ensureLocalResources(TezSessionState.java:636) ~[hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1] at org.apache.hadoop.hive.ql.exec.tez.TezSessionState.openInternal(TezSessionState.java:283) ~[hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1] at org.apache.hadoop.hive.ql.exec.tez.TezSessionPoolSession.openInternal(TezSessionPoolSession.java:124) ~[hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1] at org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:241) ~[hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1] at org.apache.hadoop.hive.ql.exec.tez.TezTask.ensureSessionHasResources(TezTask.java:448) ~[hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1] at org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:215) [hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1] at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:212) [hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1] at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105) [hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1] at org.apache.hadoop.hive.ql.Executor.launchTask(Executor.java:361) [hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1] at org.apache.hadoop.hive.ql.Executor.launchTasks(Executor.java:334) [hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1] at org.apache.hadoop.hive.ql.Executor.runTasks(Executor.java:245) [hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1] at org.apache.hadoop.hive.ql.Executor.execute(Executor.java:106) [hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1] at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:348) [hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1] at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:204) [hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1] at org.apache.hadoop.hive.ql.Driver.run(Driver.java:153) [hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1] at org.apache.hadoop.hive.ql.Driver.run(Driver.java:148) [hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1] at org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:185) [hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1] at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:233) [hive-service-4.0.0-alpha-1.jar:4.0.0-alpha-1] at org.apache.hive.service.cli.operation.SQLOperation.access$500(SQLOperation.java:88) [hive-service-4.0.0-alpha-1.jar:4.0.0-alpha-1] at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:336) [hive-service-4.0.0-alpha-1.jar:4.0.0-alpha-1] at
[jira] [Created] (HIVE-26017) Insert with partition value containing colon and space is creating partition having wrong value
mahesh kumar behera created HIVE-26017: -- Summary: Insert with partition value containing colon and space is creating partition having wrong value Key: HIVE-26017 URL: https://issues.apache.org/jira/browse/HIVE-26017 Project: Hive Issue Type: Bug Components: HiveServer2 Affects Versions: 4.0.0 Reporter: mahesh kumar behera Assignee: mahesh kumar behera The path used for generating the dynamic partition value is obtained from uri. This is causing the serialised value to be used for partition name generation and wrong names are generated. The path value should be used, not the URI. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HIVE-25877) Load table from concurrent thread causes FileNotFoundException
mahesh kumar behera created HIVE-25877: -- Summary: Load table from concurrent thread causes FileNotFoundException Key: HIVE-25877 URL: https://issues.apache.org/jira/browse/HIVE-25877 Project: Hive Issue Type: Bug Reporter: mahesh kumar behera Assignee: mahesh kumar behera As part of the direct insert optimisation (same issue is there for MM table also, without direct insert optimisation), the files from Tez jobs are moved to the table directory for ACID tables. Then the duplicate removal is done. Each session scan through the tables and cleans up the file related to specific session. But the iterator is created over all the files. So the FileNotFoundException is thrown when multiple sessions are acting on same table and the first session cleans up its data which is being read by the second session. {code:java} Caused by: java.io.FileNotFoundException: File hdfs://mbehera-1.mbehera.root.hwx.site:8020/warehouse/tablespace/managed/hive/tbl4/_tmp.delta_981_981_ does not exist. at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1275) ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?] at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1249) ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?] at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1194) ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?] at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1190) ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?] at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) ~[hadoop-common-3.1.1.7.2.14.0-117.jar:?] at org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(DistributedFileSystem.java:1208) ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?] at org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:2144) ~[hadoop-common-3.1.1.7.2.14.0-117.jar:?] at org.apache.hadoop.fs.FileSystem$5.handleFileStat(FileSystem.java:2332) ~[hadoop-common-3.1.1.7.2.14.0-117.jar:?] at org.apache.hadoop.fs.FileSystem$5.hasNext(FileSystem.java:2309) ~[hadoop-common-3.1.1.7.2.14.0-117.jar:?] at org.apache.hadoop.hive.ql.exec.Utilities.getDirectInsertDirectoryCandidatesRecursive(Utilities.java:4447) ~[hive-exec-3.1.3000.7.2.14.0-117.jar:3.1.3000.7.2.14.0-SNAPSHOT] at org.apache.hadoop.hive.ql.exec.Utilities.getDirectInsertDirectoryCandidates(Utilities.java:4413) ~[hive-exec-3.1.3000.7.2.14.0-117.jar:3.1.3000.7.2.14.0-SNAPSHOT] at org.apache.hadoop.hive.ql.metadata.Hive.getValidPartitionsInPath(Hive.java:2816) ~[hive-exec-3.1.3000.7.2.14.0-117.jar:3.1.3000.7.2.14.0-SNAPSHOT] {code} {code:java} Caused by: java.io.FileNotFoundException: File hdfs://mbehera-1.mbehera.root.hwx.site:8020/warehouse/tablespace/managed/hive/tbl4/.hive-staging_hive_2022-01-19_05-18-38_933_1683918321120508074-54 does not exist. at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1275) ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?] at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1249) ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?] at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1194) ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?] at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1190) ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?] at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) ~[hadoop-common-3.1.1.7.2.14.0-117.jar:?] at org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(DistributedFileSystem.java:1208) ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?] at org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:2144) ~[hadoop-common-3.1.1.7.2.14.0-117.jar:?] at org.apache.hadoop.fs.FileSystem$5.handleFileStat(FileSystem.java:2332) ~[hadoop-common-3.1.1.7.2.14.0-117.jar:?] at org.apache.hadoop.fs.FileSystem$5.hasNext(FileSystem.java:2309) ~[hadoop-common-3.1.1.7.2.14.0-117.jar:?] at org.apache.hadoop.hive.ql.exec.Utilities.getDirectInsertDirectoryCandidatesRecursive(Utilities.java:4447) ~[hive-exec-3.1.3000.7.2.14.0-117.jar:3.1.3000.7.2.14.0-SNAPSHOT] at org.apache.hadoop.hive.ql.exec.Utilities.getDirectInsertDirectoryCandidates(Utilities.java:4413) ~[hive-exec-3.1.3000.7.2.14.0-117.jar:3.1.3000.7.2.14.0-SNAPSHOT] at org.apache.hadoop.hive.ql.exec.Utilities.getFullDPSpecs(Utilities.java:2971) ~[hive-exec-3.1.3000.7.2.14.0-117.jar:3.1.3000.7.2.14.0-SNAPSHOT] {code} -- This
[jira] [Created] (HIVE-25868) AcidHouseKeeperService fails to purgeCompactionHistory if the entries in COMPLETED_COMPACTIONS tables
mahesh kumar behera created HIVE-25868: -- Summary: AcidHouseKeeperService fails to purgeCompactionHistory if the entries in COMPLETED_COMPACTIONS tables Key: HIVE-25868 URL: https://issues.apache.org/jira/browse/HIVE-25868 Project: Hive Issue Type: Bug Components: Hive, Metastore, Standalone Metastore Reporter: mahesh kumar behera Assignee: mahesh kumar behera To purge the entries, prepared statement is created. If the number of entries in the prepared statement goes beyond the limit of backend db (for postgres it around 32k), then the operation fails. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HIVE-25864) Hive query optimisation creates wrong plan for predicate pushdown with windowing function
mahesh kumar behera created HIVE-25864: -- Summary: Hive query optimisation creates wrong plan for predicate pushdown with windowing function Key: HIVE-25864 URL: https://issues.apache.org/jira/browse/HIVE-25864 Project: Hive Issue Type: Bug Reporter: mahesh kumar behera Assignee: mahesh kumar behera In case of a query with windowing function, the deterministic predicates are pushed down below the window function. Before pushing down, the predicate is converted to refer the project operator values. But the same conversion is done again while creating the project and thus causing wrong plan generation. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HIVE-25808) Analyse table does not fail for non existing partitions
mahesh kumar behera created HIVE-25808: -- Summary: Analyse table does not fail for non existing partitions Key: HIVE-25808 URL: https://issues.apache.org/jira/browse/HIVE-25808 Project: Hive Issue Type: Bug Reporter: mahesh kumar behera If all the column names are given in the analyse command , then the query fails. But if all the partition column values are not given then its not failing. analyze table tbl partition *(fld1 = 2, fld2 = 3)* COMPUTE STATISTICS FOR COLUMNS – This will fail with SemanticException, if partition corresponds to fld1 = 2, fld2 = 3 does not exists. But analyze table tbl partition *(fld1 = 2)* COMPUTE STATISTICS FOR COLUMNS, this will not fail and it will compute stats for whole table. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HIVE-25778) Hive DB creation is failing when MANAGEDLOCATION is specified with existing location
mahesh kumar behera created HIVE-25778: -- Summary: Hive DB creation is failing when MANAGEDLOCATION is specified with existing location Key: HIVE-25778 URL: https://issues.apache.org/jira/browse/HIVE-25778 Project: Hive Issue Type: Bug Components: HiveServer2, Metastore Reporter: mahesh kumar behera Assignee: mahesh kumar behera As part of HIVE-23387 check is added to restrict user from creating database with managed table location, if the location is already present. This was not the case. As this is causing backward compatibility issue, the check needs to be removed. {code:java} if (madeManagedDir) { LOG.info("Created database path in managed directory " + dbMgdPath); } else { throw new MetaException( "Unable to create database managed directory " + dbMgdPath + ", failed to create database " + db.getName()); } {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HIVE-25638) Select returns the deleted records in Hive ACID table
mahesh kumar behera created HIVE-25638: -- Summary: Select returns the deleted records in Hive ACID table Key: HIVE-25638 URL: https://issues.apache.org/jira/browse/HIVE-25638 Project: Hive Issue Type: Bug Components: HiveServer2 Reporter: mahesh kumar behera Assignee: mahesh kumar behera Hive stores the stripe stats in the ORC files. During select, these stats are used to create the SARG. The SARG is used to reduce the records read from the delete-delta files. Currently, in case where the number of stripes are more than 1, the SARG generated is not proper as it uses the first stripe index for both min and max key interval. The max key interval should be obtained from last stripe index. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-25540) Enable batch updation of column stats only for MySql and Postgres
mahesh kumar behera created HIVE-25540: -- Summary: Enable batch updation of column stats only for MySql and Postgres Key: HIVE-25540 URL: https://issues.apache.org/jira/browse/HIVE-25540 Project: Hive Issue Type: Sub-task Reporter: mahesh kumar behera Assignee: mahesh kumar behera The batch updation of partition column stats using direct sql is tested only for MySql and Postgres. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-25527) LLAP Scheduler task exits with fatal error if the executor node is down.
mahesh kumar behera created HIVE-25527: -- Summary: LLAP Scheduler task exits with fatal error if the executor node is down. Key: HIVE-25527 URL: https://issues.apache.org/jira/browse/HIVE-25527 Project: Hive Issue Type: Bug Components: HiveServer2 Reporter: mahesh kumar behera Assignee: mahesh kumar behera In case the executor host has gone down, activeInstances will be updated with null. So we need to check for empty/null values before accessing it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-25438) Update partition column stats fails with invalid syntax error for MySql
mahesh kumar behera created HIVE-25438: -- Summary: Update partition column stats fails with invalid syntax error for MySql Key: HIVE-25438 URL: https://issues.apache.org/jira/browse/HIVE-25438 Project: Hive Issue Type: Sub-task Reporter: mahesh kumar behera Assignee: mahesh kumar behera The quotes are not supported by mysql if ANSI_QUOTES is not set. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-25432) Support Join reordering for null safe equality operator.
mahesh kumar behera created HIVE-25432: -- Summary: Support Join reordering for null safe equality operator. Key: HIVE-25432 URL: https://issues.apache.org/jira/browse/HIVE-25432 Project: Hive Issue Type: Sub-task Components: Hive, HiveServer2 Reporter: mahesh kumar behera Support Join reordering for null safe equality operator. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-25431) Enable CBO for null safe equality operator.
mahesh kumar behera created HIVE-25431: -- Summary: Enable CBO for null safe equality operator. Key: HIVE-25431 URL: https://issues.apache.org/jira/browse/HIVE-25431 Project: Hive Issue Type: Bug Components: Hive, HiveServer2 Reporter: mahesh kumar behera Assignee: mahesh kumar behera The CBO is disabled for null safe equality (<=>) operator. This is causing the sub optimal join execution for some queries. As null safe equality is supported by joins, the CBO can be enabled for it. There will be issues with join reordering as Hive does not support join reordering for null safe equality operator. But with CBO enabled the join plan will be better. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-25417) Null bit vector is not handled while getting the stats for Postgres backend
mahesh kumar behera created HIVE-25417: -- Summary: Null bit vector is not handled while getting the stats for Postgres backend Key: HIVE-25417 URL: https://issues.apache.org/jira/browse/HIVE-25417 Project: Hive Issue Type: Sub-task Components: HiveServer2 Reporter: mahesh kumar behera Assignee: mahesh kumar behera While adding stats with null bit vector, a special string "HL" is added as Postgres does not support null value for byte columns. But while getting the stats, the conversion to null is not done. This is causing failure during deserialisation of bit vector field if the existing stats is used for merge. {code:java} The input stream is not a HyperLogLog stream. 7276-1 instead of 727676 or 7077^Mat org.apache.hadoop.hive.common.ndv.hll.HyperLogLogUtils.checkMagicString(HyperLogLogUtils.java:349)^M at org.apache.hadoop.hive.common.ndv.hll.HyperLogLogUtils.deserializeHLL(HyperLogLogUtils.java:139)^M at org.apache.hadoop.hive.common.ndv.hll.HyperLogLogUtils.deserializeHLL(HyperLogLogUtils.java:213)^M at org.apache.hadoop.hive.common.ndv.hll.HyperLogLogUtils.deserializeHLL(HyperLogLogUtils.java:227)^M at org.apache.hadoop.hive.common.ndv.NumDistinctValueEstimatorFactory.getNumDistinctValueEstimator(NumDistinctValueEstimatorFactory.java:53)^M at org.apache.hadoop.hive.metastore.columnstats.cache.LongColumnStatsDataInspector.updateNdvEstimator(LongColumnStatsDataInspector.java:124)^M at org.apache.hadoop.hive.metastore.columnstats.cache.LongColumnStatsDataInspector.getNdvEstimator(LongColumnStatsDataInspector.java:107)^M at org.apache.hadoop.hive.metastore.columnstats.merge.LongColumnStatsMerger.merge(LongColumnStatsMerger.java:36)^M at org.apache.hadoop.hive.metastore.utils.MetaStoreUtils.mergeColStats(MetaStoreUtils.java:1174)^M at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.updateTableColumnStatsWithMerge(HiveMetaStore.java:8934)^M at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.set_aggr_stats_for(HiveMetaStore.java:8800)^M at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)^M at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)^M at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)^M at java.lang.reflect.Method.invoke(Method.java:498)^M at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:160)^M at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:121)^M at com.sun.proxy.$Proxy35.set_aggr_stats_for(Unknown Source)^M at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$set_aggr_stats_for.getResult(ThriftHiveMetastore.java:20489)^M at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$set_aggr_stats_for.getResult(ThriftHiveMetastore.java:20473)^M at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)^M at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)^M at org.apache.hadoop.hive.metastore.security.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor$1.run(HadoopThriftAuthBridge.java:643)^M at org.apache.hadoop.hive.metastore.security.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor$1.run(HadoopThriftAuthBridge.java:638)^M at java.security.AccessController.doPrivileged(Native Method)^M at javax.security.auth.Subject.doAs(Subject.java:422)^M at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898)^M at org.apache.hadoop.hive.metastore.security.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor.process(HadoopThriftAuthBridge.java:638)^M at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)^M at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)^M at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)^M at java.lang.Thread.run(Thread.java:748) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-25373) Modify buildColumnStatsDesc to send configured number of stats for updation
mahesh kumar behera created HIVE-25373: -- Summary: Modify buildColumnStatsDesc to send configured number of stats for updation Key: HIVE-25373 URL: https://issues.apache.org/jira/browse/HIVE-25373 Project: Hive Issue Type: Sub-task Reporter: mahesh kumar behera Assignee: mahesh kumar behera The number of stats sent for updation should be controlled to avoid thrift error in case the size exceeds the limit. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-25342) Optimize set_aggr_stats_for for mergeColStats path.
mahesh kumar behera created HIVE-25342: -- Summary: Optimize set_aggr_stats_for for mergeColStats path. Key: HIVE-25342 URL: https://issues.apache.org/jira/browse/HIVE-25342 Project: Hive Issue Type: Sub-task Reporter: mahesh kumar behera Assignee: mahesh kumar behera The optimisation used for normal path to use direct sql can also be used for mergeColStats path. The stats to be updated can be accumulated in a temp list and that list can be used to update the stats in a batch. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-25251) Reduce overhead of adding partitions during batch loading of partitions.
mahesh kumar behera created HIVE-25251: -- Summary: Reduce overhead of adding partitions during batch loading of partitions. Key: HIVE-25251 URL: https://issues.apache.org/jira/browse/HIVE-25251 Project: Hive Issue Type: Sub-task Reporter: mahesh kumar behera Assignee: mahesh kumar behera The add partitions call done to HMS does a serial execution of data nucleus calls to add the partitions to backend DB. This can be further optimised by batching those sql statements. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-25225) Update column stat throws NPE if direct sql is disabled
mahesh kumar behera created HIVE-25225: -- Summary: Update column stat throws NPE if direct sql is disabled Key: HIVE-25225 URL: https://issues.apache.org/jira/browse/HIVE-25225 Project: Hive Issue Type: Sub-task Reporter: mahesh kumar behera Assignee: mahesh kumar behera In case direct sql is disabled, the MetaStoreDirectSql object is not initialised and thats causing NPE. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-25205) Reduce overhead of adding write notification log during batch loading of partition.
mahesh kumar behera created HIVE-25205: -- Summary: Reduce overhead of adding write notification log during batch loading of partition. Key: HIVE-25205 URL: https://issues.apache.org/jira/browse/HIVE-25205 Project: Hive Issue Type: Sub-task Components: Hive, HiveServer2 Reporter: mahesh kumar behera Assignee: mahesh kumar behera During batch loading of partition the write notification logs are added for each partition added. This is causing delay in execution as the call to HMS is done for each partition. This can be optimised by adding a new API in HMS to send a batch of partition and then this batch can be added together to the backend database. Once we have a batch of notification log, at HMS side, code can be optimised to add the logs using single call to backend RDBMS. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-25204) Reduce overhead of adding notification log for update partition column statistics
mahesh kumar behera created HIVE-25204: -- Summary: Reduce overhead of adding notification log for update partition column statistics Key: HIVE-25204 URL: https://issues.apache.org/jira/browse/HIVE-25204 Project: Hive Issue Type: Sub-task Components: Hive, HiveServer2 Reporter: mahesh kumar behera Assignee: mahesh kumar behera The notification logs for partition column statistics can be optimised by adding them in batch. In the current implementation its done one by one causing multiple sql execution in the backend RDBMS. These SQL executions can be batched to reduce the execution time. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-25181) Analyse and optimise execution time for batch loading of partitions.
mahesh kumar behera created HIVE-25181: -- Summary: Analyse and optimise execution time for batch loading of partitions. Key: HIVE-25181 URL: https://issues.apache.org/jira/browse/HIVE-25181 Project: Hive Issue Type: Task Reporter: mahesh kumar behera Assignee: mahesh kumar behera When load partition is done in batch, of more than 10k, the execution time is exceeding hours. This may be an issue for ETL type of work load. This task is to track the issues and fix it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-25142) Rehashing in map join fast hash table causing corruption for large keys
mahesh kumar behera created HIVE-25142: -- Summary: Rehashing in map join fast hash table causing corruption for large keys Key: HIVE-25142 URL: https://issues.apache.org/jira/browse/HIVE-25142 Project: Hive Issue Type: Bug Components: Hive Reporter: mahesh kumar behera Assignee: mahesh kumar behera In map join the hash table is created using the keys. To support rehashing, the keys are stored in write buffer. The hash table contains the offset of the keys along with the hash code. When rehashing is done, the offset is extracted from the hash table and then hash code is generated again. For large keys of size greater than 255, the key length is also stored along with the key. In case of fast hash table implementation the way key is extracted is not proper. There is a code bug and thats causing the wrong key to be extracted and causing wrong hash code generation. This is causing the corruption in the hash table. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-25042) Add support for map data type in Common merge join and SMB Join
mahesh kumar behera created HIVE-25042: -- Summary: Add support for map data type in Common merge join and SMB Join Key: HIVE-25042 URL: https://issues.apache.org/jira/browse/HIVE-25042 Project: Hive Issue Type: Sub-task Components: Hive, HiveServer2 Reporter: mahesh kumar behera Merge join results depends on the underlying sorter used by the mapper task as we need to judge the direction after each key comparison. So the comparison done during join has to match the way the records are sorted by the mapper. As per the sorter used by mapper task (PipelinedSorter), hash-maps with same key-value pair in different order are not equal. So the merge join also behaves the same way. But map join treats them as equal. We have to modify the pipelined sorter code to handle the map datatype. Then support has to be added in the Join code to support map types. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-24996) Conversion of PIG script with multiple store causing the merging of multiple sql statements
mahesh kumar behera created HIVE-24996: -- Summary: Conversion of PIG script with multiple store causing the merging of multiple sql statements Key: HIVE-24996 URL: https://issues.apache.org/jira/browse/HIVE-24996 Project: Hive Issue Type: Bug Reporter: mahesh kumar behera Assignee: mahesh kumar behera The sql write is not reset after sql statement is converted. This is causing the next sql statements to be merged with the previous one. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-24995) Add support for complex type operator in Join with non equality condition
mahesh kumar behera created HIVE-24995: -- Summary: Add support for complex type operator in Join with non equality condition Key: HIVE-24995 URL: https://issues.apache.org/jira/browse/HIVE-24995 Project: Hive Issue Type: Sub-task Reporter: mahesh kumar behera Assignee: mahesh kumar behera This subtask is specifically to support non equal comparison like greater than, smaller than etc as join condition. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-24989) Support vectorisation of join with key columns of complex types
mahesh kumar behera created HIVE-24989: -- Summary: Support vectorisation of join with key columns of complex types Key: HIVE-24989 URL: https://issues.apache.org/jira/browse/HIVE-24989 Project: Hive Issue Type: Bug Components: Hive Affects Versions: 4.0.0 Reporter: mahesh kumar behera Assignee: mahesh kumar behera Fix For: 4.0.0 Hive fails to execute joins on array type columns as the comparison functions are not able to handle array type columns. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-24988) Add support for complex types columns for Dynamic Partition pruning Optimisation
mahesh kumar behera created HIVE-24988: -- Summary: Add support for complex types columns for Dynamic Partition pruning Optimisation Key: HIVE-24988 URL: https://issues.apache.org/jira/browse/HIVE-24988 Project: Hive Issue Type: Bug Components: Hive Affects Versions: 4.0.0 Reporter: mahesh kumar behera Assignee: mahesh kumar behera Fix For: 4.0.0 Hive fails to execute joins on array type columns as the comparison functions are not able to handle array type columns. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-24977) Query compilation failing with NPE during reduce sink deduplication
mahesh kumar behera created HIVE-24977: -- Summary: Query compilation failing with NPE during reduce sink deduplication Key: HIVE-24977 URL: https://issues.apache.org/jira/browse/HIVE-24977 Project: Hive Issue Type: Bug Reporter: mahesh kumar behera Assignee: mahesh kumar behera During reduce sink deduplication if some columns from the RS can not be backtracked to a terminal operator then null is returned. Check for null is present for some case and its missing in some cases. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-24883) Add support for array type columns in Hive Joins
mahesh kumar behera created HIVE-24883: -- Summary: Add support for array type columns in Hive Joins Key: HIVE-24883 URL: https://issues.apache.org/jira/browse/HIVE-24883 Project: Hive Issue Type: Bug Components: Hive Affects Versions: 4.0.0 Reporter: mahesh kumar behera Assignee: mahesh kumar behera Fix For: 4.0.0 Hive fails to execute joins on array type columns as the comparison functions are not able to handle array type columns. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-24589) Drop catalog failing with deadlock error for Oracle backend dbms.
mahesh kumar behera created HIVE-24589: -- Summary: Drop catalog failing with deadlock error for Oracle backend dbms. Key: HIVE-24589 URL: https://issues.apache.org/jira/browse/HIVE-24589 Project: Hive Issue Type: Bug Reporter: mahesh kumar behera Assignee: mahesh kumar behera When we do a drop catalog we drop the catalog from the CTLGS table. The DBS table has a foreign key reference on CTLGS for CTLG_NAME. This is causing the DBS table to be locked exclusively and causing deadlocks. This can be avoided by creating an index in the DBS table on CTLG_NAME. {code:java} CREATE INDEX CTLG_NAME_DBS ON DBS(CTLG_NAME); {code} {code:java} Oracle Database maximizes the concurrency control of parent keys in relation to dependent foreign keys.Locking behaviour depends on whether foreign key columns are indexed. If foreign keys are not indexed, then the child table will probably be locked more frequently, deadlocks will occur, and concurrency will be decreased. For this reason foreign keys should almost always be indexed. The only exception is when the matching unique or primary key is never updated or deleted.{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-24580) Add support for combiner in hash mode group aggregation (Support for distinct)
mahesh kumar behera created HIVE-24580: -- Summary: Add support for combiner in hash mode group aggregation (Support for distinct) Key: HIVE-24580 URL: https://issues.apache.org/jira/browse/HIVE-24580 Project: Hive Issue Type: Bug Components: Hive Reporter: mahesh kumar behera Assignee: mahesh kumar behera In map side group aggregation, partial grouped aggregation is calculated to reduce the data written to disk by map task. In case of hash aggregation, where the input data is not sorted, hash table is used (with sorting also being performed before flushing). If the hash table size increases beyond configurable limit, data is flushed to disk and new hash table is generated. If the reduction by hash table is less than min hash aggregation reduction calculated during compile time, the map side aggregation is converted to streaming mode. So if the first few batch of records does not result into significant reduction, then the mode is switched to streaming mode. This may have impact on performance, if the subsequent batch of records have less number of distinct values. To improve performance both in Hash and Streaming mode, a combiner can be added to the map task after the keys are sorted. This will make sure that the aggregation is done if possible and reduce the data written to disk. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-24503) Optimize vector row serde to avoid type check at run time
mahesh kumar behera created HIVE-24503: -- Summary: Optimize vector row serde to avoid type check at run time Key: HIVE-24503 URL: https://issues.apache.org/jira/browse/HIVE-24503 Project: Hive Issue Type: Bug Components: Hive Reporter: mahesh kumar behera Assignee: mahesh kumar behera Serialization/Deserialization of vectorized batch done at VectorSerializeRow and VectorDeserializeRow does a type checking for each column of each row. This becomes very costly when there are billions of rows to read/write. This can be optimized if the type check is done during init time and specific reader/writer classes are created. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-24471) Add support for combiner in hash mode group aggregation
mahesh kumar behera created HIVE-24471: -- Summary: Add support for combiner in hash mode group aggregation Key: HIVE-24471 URL: https://issues.apache.org/jira/browse/HIVE-24471 Project: Hive Issue Type: Bug Components: Hive Reporter: mahesh kumar behera Assignee: mahesh kumar behera In map side group aggregation, partial grouped aggregation is calculated to reduce the data written to disk by map task. In case of hash aggregation, where the input data is not sorted, hash table is used. If the hash table size increases beyond configurable limit, data is flushed to disk and new hash table is generated. If the reduction by hash table is less than min hash aggregation reduction calculated during compile time, the map side aggregation is converted to streaming mode. So if the first few batch of records does not result into significant reduction, then the mode is switched to streaming mode. This may have impact on performance, if the subsequent batch of records have less number of distinct values. To mitigate this situation, a combiner can be added to the map task after the keys are sorted. This will make sure that the aggregation is done if possible and reduce the data written to disk. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-24378) Leading and trailing spaces are not removed before decimal conversion
mahesh kumar behera created HIVE-24378: -- Summary: Leading and trailing spaces are not removed before decimal conversion Key: HIVE-24378 URL: https://issues.apache.org/jira/browse/HIVE-24378 Project: Hive Issue Type: Bug Reporter: mahesh kumar behera Assignee: mahesh kumar behera The decimal conversion is taking care of removing the extra spaces in some scenarios. because of this the numbers are getting converted to null. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-24373) Wrong predicate is pushed down for view with constant value projection.
mahesh kumar behera created HIVE-24373: -- Summary: Wrong predicate is pushed down for view with constant value projection. Key: HIVE-24373 URL: https://issues.apache.org/jira/browse/HIVE-24373 Project: Hive Issue Type: Bug Reporter: mahesh kumar behera Assignee: mahesh kumar behera For below query the predicate pushed down for one of the table scan is not proper. {code:java} set hive.explain.user=false; set hive.cbo.enable=false; set hive.optimize.ppd=true;DROP TABLE arc; CREATE table arc(`dt_from` string, `dt_to` string); CREATE table loc1(`dt_from` string, `dt_to` string); CREATE VIEW view AS SELECT '' as DT_FROM, uuid() as DT_TO FROM loc1 UNION ALL SELECT dt_from as DT_FROM, uuid() as DT_TO FROM arc; EXPLAIN SELECT dt_from, dt_to FROM view WHERE '2020' between dt_from and dt_to; {code} For table loc1, DT_FROM is projected as '' so the predicate "predicate: '2020' BETWEEN '' AND _col1 (type: boolean)" is proper. But for table arc, the column is projected so the predicate should be "predicate: '2020' BETWEEN _col0 (type: boolean) AND _col1 (type: boolean)". This is because the predicates are stored in a map for each expression. Here the expression is "_col0". When the predicate is pushed down the union, the same predicate is used for creating the filter expression. Later when constant replacement is done, the first filter is overwriting the second one. So we should create a clone (as done at other places) before using the cached predicate for filter. This way the overwrite can be avoided. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-24362) AST tree processing is suboptimal for tree with large number of nodes
mahesh kumar behera created HIVE-24362: -- Summary: AST tree processing is suboptimal for tree with large number of nodes Key: HIVE-24362 URL: https://issues.apache.org/jira/browse/HIVE-24362 Project: Hive Issue Type: Bug Components: Hive Reporter: mahesh kumar behera Assignee: mahesh kumar behera In hive the children information is stored as list of objects. During processing of the children of a node, the list of object is converted to list of Nodes. This can cause large compilation time if the number of children is large. The list of children can be cached in the AST node to avoid this re-computation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-24284) NPE when parsing druid logs using Hive
mahesh kumar behera created HIVE-24284: -- Summary: NPE when parsing druid logs using Hive Key: HIVE-24284 URL: https://issues.apache.org/jira/browse/HIVE-24284 Project: Hive Issue Type: Bug Components: Hive Reporter: mahesh kumar behera Assignee: mahesh kumar behera As per current Sys-logger parser, its always expecting a valid proc id. But as per RFC3164 and RFC5424, the proc id can be skipped.So hive should handled it by using NILVALUE/empty string in case the proc id is null. {code:java} Caused by: java.lang.NullPointerException: null at java.lang.String.(String.java:566) at org.apache.hadoop.hive.ql.log.syslog.SyslogParser.createEvent(SyslogParser.java:361) at org.apache.hadoop.hive.ql.log.syslog.SyslogParser.readEvent(SyslogParser.java:326) at org.apache.hadoop.hive.ql.log.syslog.SyslogSerDe.deserialize(SyslogSerDe.java:95) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-24198) Map side SMB join produceing wrong result
mahesh kumar behera created HIVE-24198: -- Summary: Map side SMB join produceing wrong result Key: HIVE-24198 URL: https://issues.apache.org/jira/browse/HIVE-24198 Project: Hive Issue Type: Bug Components: Hive Reporter: mahesh kumar behera Assignee: mahesh kumar behera CREATE TABLE tbl1_n5(key int, value string) CLUSTERED BY (key) SORTED BY (key) INTO 2 BUCKETS ; CREATE TABLE tbl2_n4(key int, value string) CLUSTERED BY (key) SORTED BY (key) INTO 2 BUCKETS; set hive.auto.convert.join=true; set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true; set hive.input.format = org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat; set hive.auto.convert.sortmerge.join=true; set hive.auto.convert.sortmerge.join.to.mapjoin=false; set hive.auto.convert.join.noconditionaltask.ize=1; set hive.optimize.semijoin.conversion = false; insert into tbl2_n4 values (2, 'val_2'), (0, 'val_0'), (0, 'val_0'), (0, 'val_0'), (4, 'val_4') ,(5, 'val_5') ,(5, 'val_5') , (5, 'val_5'), (8, 'val_8'), (9, 'val_9'); insert into tbl1_n5 values (2, 'val_2'), (0, 'val_0'), (0, 'val_0'), (0, 'val_0'), (4, 'val_4') ,(5, 'val_5') ,(5, 'val_5') , (5, 'val_5'), (8, 'val_8'), (9, 'val_9'); Select * from (select b.key as key, count(*) as value from tbl1_n5 b where key < 6 group by b.key) subq1 join (select a.key as key, a.value as value from tbl2_n4 a where key < 6) subq2 on subq1.key = subq2.key; The above select is producing 0,0,0,2,4,5,5,5,5,5,5 instead of 0,0,0,2,4,5,5,5 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-24013) Move anti join conversion after join reordering rule
mahesh kumar behera created HIVE-24013: -- Summary: Move anti join conversion after join reordering rule Key: HIVE-24013 URL: https://issues.apache.org/jira/browse/HIVE-24013 Project: Hive Issue Type: Bug Reporter: mahesh kumar behera Assignee: mahesh kumar behera The current anti join conversion does not check for null filters on right side of join if it's within OR conditions. Only those filters separated by AND conditions are supported. For example queries like "select t1.fld from tbl1 t1 left join tbl2 t2 on t1.fld = t2.fld where t2.fld is null or t2.fld1 is null" are not converted to anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-23992) Support null filter within or clause for Anti Join
mahesh kumar behera created HIVE-23992: -- Summary: Support null filter within or clause for Anti Join Key: HIVE-23992 URL: https://issues.apache.org/jira/browse/HIVE-23992 Project: Hive Issue Type: Bug Reporter: mahesh kumar behera Assignee: mahesh kumar behera The current anti join conversion does not support join condition which is always true. The queries like select * from tbl t1 where not exists (select 1 from t2) is not converted to anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-23991) Support isAlwaysTrue for Anti Join
mahesh kumar behera created HIVE-23991: -- Summary: Support isAlwaysTrue for Anti Join Key: HIVE-23991 URL: https://issues.apache.org/jira/browse/HIVE-23991 Project: Hive Issue Type: Bug Reporter: mahesh kumar behera Assignee: mahesh kumar behera The current anti join conversion does not support direct conversion of not-exists to anti join. The not exists sub query is converted first to left out join and then its converted to anti join. This may cause some of the optimization rule to be skipped. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-23981) Use task counter enum to get the approximate counter value
mahesh kumar behera created HIVE-23981: -- Summary: Use task counter enum to get the approximate counter value Key: HIVE-23981 URL: https://issues.apache.org/jira/browse/HIVE-23981 Project: Hive Issue Type: Bug Reporter: mahesh kumar behera There are cases when compiler misestimates key count and this results in a number of hashtable resizes during runtime. [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/fast/VectorMapJoinFastHashTableLoader.java#L128] In such cases, it would be good to get "approximate_input_records" (TEZ-4207) counter from upstream to compute the key count more accurately at runtime. * * Options h4. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-23933) Add getRowCountInt support for anti join in calcite.
mahesh kumar behera created HIVE-23933: -- Summary: Add getRowCountInt support for anti join in calcite. Key: HIVE-23933 URL: https://issues.apache.org/jira/browse/HIVE-23933 Project: Hive Issue Type: Bug Reporter: mahesh kumar behera Assignee: mahesh kumar behera The current anti join conversion does not support direct conversion of not-exists to anti join. The not exists sub query is converted first to left out join and then its converted to anti join. This may cause some of the optimization rule to be skipped. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-23928) Support conversion of not-exists to Anti join directly
mahesh kumar behera created HIVE-23928: -- Summary: Support conversion of not-exists to Anti join directly Key: HIVE-23928 URL: https://issues.apache.org/jira/browse/HIVE-23928 Project: Hive Issue Type: Bug Reporter: mahesh kumar behera Assignee: mahesh kumar behera Support HiveJoinProjectTransposeRule for Anti Join -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-23921) Support HiveJoinProjectTransposeRule for Anti Join
mahesh kumar behera created HIVE-23921: -- Summary: Support HiveJoinProjectTransposeRule for Anti Join Key: HIVE-23921 URL: https://issues.apache.org/jira/browse/HIVE-23921 Project: Hive Issue Type: Bug Reporter: mahesh kumar behera Assignee: mahesh kumar behera If we have a PK-FK join that is only appending columns to the FK side, it basically means it is not filtering anything (everything is matching). If that is the case, then ANTIJOIN result would be empty. We could detect this at planning time and trigger the rewriting. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-23920) Need to handle HiveJoinConstraintsRule for Anti Join
mahesh kumar behera created HIVE-23920: -- Summary: Need to handle HiveJoinConstraintsRule for Anti Join Key: HIVE-23920 URL: https://issues.apache.org/jira/browse/HIVE-23920 Project: Hive Issue Type: Bug Reporter: mahesh kumar behera Assignee: mahesh kumar behera Currently in Hive we create different operator for different kind of join. n Calcite, it all seems to be based on a single Join class in newer releases. So the classes like HiveAntiJoin, HiveSemiJoin can be merged into one. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-23919) Merge all kind of Join operator variants (Semi, Anti, Normal) into one.
mahesh kumar behera created HIVE-23919: -- Summary: Merge all kind of Join operator variants (Semi, Anti, Normal) into one. Key: HIVE-23919 URL: https://issues.apache.org/jira/browse/HIVE-23919 Project: Hive Issue Type: Bug Reporter: mahesh kumar behera Assignee: mahesh kumar behera For Anti Join, we emit the records if the join condition does not satisfies. In case of PK-FK rule we have to explore if this can be exploited to speed up Anti Join processing. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-23907) Hash table type should be considered for calculating the Map join table size
mahesh kumar behera created HIVE-23907: -- Summary: Hash table type should be considered for calculating the Map join table size Key: HIVE-23907 URL: https://issues.apache.org/jira/browse/HIVE-23907 Project: Hive Issue Type: Bug Reporter: mahesh kumar behera Assignee: mahesh kumar behera For Anti Join, we emit the records if the join condition does not satisfies. In case of PK-FK rule we have to explore if this can be exploited to speed up Anti Join processing. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-23906) Analyze and implement PK-FK based optimization for Anti join
mahesh kumar behera created HIVE-23906: -- Summary: Analyze and implement PK-FK based optimization for Anti join Key: HIVE-23906 URL: https://issues.apache.org/jira/browse/HIVE-23906 Project: Hive Issue Type: Bug Reporter: mahesh kumar behera Assignee: mahesh kumar behera Currently hive does not support Anti join. The query for anti join is converted to left outer join and null filter on right side join key is added to get the desired result. This is causing # Extra computation — The left outer join projects the redundant columns from right side. Along with that, filtering is done to remove the redundant rows. This is can be avoided in case of anti join as anti join will project only the required columns and rows from the left side table. # Extra shuffle — In case of anti join the duplicate records moved to join node can be avoided from the child node. This can reduce significant amount of data movement if the number of distinct rows( join keys) is significant. # Extra Memory Usage - In case of map based anti join , hash set is sufficient as just the key is required to check if the records matches the join condition. In case of left join, we need the key and the non key columns also and thus a hash table will be required. For a query like {code:java} select wr_order_number FROM web_returns LEFT JOIN web_sales ON wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} The number of distinct ws_order_number in web_sales table in a typical 10TB TPCDS set up is just 10% of total records. So when we convert this query to anti join, instead of 7 billion rows, only 600 million rows are moved to join node. In the current patch, just one conversion is done. The pattern of project->filter->left-join is converted to project->anti-join. This will take care of sub queries with “not exists” clause. The queries with “not exists” are converted first to filter + left-join and then its converted to anti join. The queries with “not in” are not handled in the current patch. >From execution side, both merge join and map join with vectorized execution >is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-23905) Remove duplicate code in vector map join execution for Anti join and Semi Join.
mahesh kumar behera created HIVE-23905: -- Summary: Remove duplicate code in vector map join execution for Anti join and Semi Join. Key: HIVE-23905 URL: https://issues.apache.org/jira/browse/HIVE-23905 Project: Hive Issue Type: Bug Reporter: mahesh kumar behera Assignee: mahesh kumar behera [TestMapJoinOperator.java|https://github.com/apache/hive/pull/1147/files/ee4390223caf1816ba6c07c1245876dc3c99d1e9#diff-a96ed41dcf0566f31b90b5ac75fbf20b] should be updated to add test cases related to anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-23904) Update TestMapJoinOperator for adding anti join test cases.
mahesh kumar behera created HIVE-23904: -- Summary: Update TestMapJoinOperator for adding anti join test cases. Key: HIVE-23904 URL: https://issues.apache.org/jira/browse/HIVE-23904 Project: Hive Issue Type: Bug Reporter: mahesh kumar behera Assignee: mahesh kumar behera In case of anti join, bloom filter can be created on left side also ("IN (keylist right table)").But the filter should be "not-in" ("NOT IN (keylist right table)") as we want to select the records from left side which are not present in the right side. But it may cause wrong result as bloom filter may have false positive and thus simply adding not is not correct, special handling is required for "NOT IN". [https://github.com/jmhodges/opposite_of_a_bloom_filter/] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-23903) Support "not-in" for bloom filter
mahesh kumar behera created HIVE-23903: -- Summary: Support "not-in" for bloom filter Key: HIVE-23903 URL: https://issues.apache.org/jira/browse/HIVE-23903 Project: Hive Issue Type: Bug Reporter: mahesh kumar behera Assignee: mahesh kumar behera Currently hive does not support Anti join. The query for anti join is converted to left outer join and null filter on right side join key is added to get the desired result. This is causing # Extra computation — The left outer join projects the redundant columns from right side. Along with that, filtering is done to remove the redundant rows. This is can be avoided in case of anti join as anti join will project only the required columns and rows from the left side table. # Extra shuffle — In case of anti join the duplicate records moved to join node can be avoided from the child node. This can reduce significant amount of data movement if the number of distinct rows( join keys) is significant. # Extra Memory Usage - In case of map based anti join , hash set is sufficient as just the key is required to check if the records matches the join condition. In case of left join, we need the key and the non key columns also and thus a hash table will be required. For a query like {code:java} select wr_order_number FROM web_returns LEFT JOIN web_sales ON wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} The number of distinct ws_order_number in web_sales table in a typical 10TB TPCDS set up is just 10% of total records. So when we convert this query to anti join, instead of 7 billion rows, only 600 million rows are moved to join node. In the current patch, just one conversion is done. The pattern of project->filter->left-join is converted to project->anti-join. This will take care of sub queries with “not exists” clause. The queries with “not exists” are converted first to filter + left-join and then its converted to anti join. The queries with “not in” are not handled in the current patch. >From execution side, both merge join and map join with vectorized execution >is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-23716) Support Anti Join in Hive
mahesh kumar behera created HIVE-23716: -- Summary: Support Anti Join in Hive Key: HIVE-23716 URL: https://issues.apache.org/jira/browse/HIVE-23716 Project: Hive Issue Type: Bug Reporter: mahesh kumar behera Assignee: mahesh kumar behera Currently hive does not support Anti join. The query for anti join is converted to left outer join and null filter on right side join key is added to get the desired result. This is causing # Extra computation — The left outer join projects the redundant columns from right side. Along with that, filtering is done to remove the redundant rows. This is can be avoided in case of anti join as anti join will project only the required columns and rows from the left side table. # Extra shuffle — In case of anti join the duplicate records moved to join node can be avoided from the child node. This can reduce significant amount of data movement if the number of distinct rows( join keys) is significant. # Extra Memory Usage - In case of map based anti join , hash set is sufficient as just the key is required to check if the records matches the join condition. In case of left join, we need the key and the non key columns also and thus a hash table will be required. For a query like {code:java} select wr_order_number FROM web_returns LEFT JOIN web_sales ON wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code} The number of distinct ws_order_number in web_sales table in a typical 10TB TPCDS set up is just 10% of total records. So when we convert this query to anti join, instead of 7 billion rows, only 600 million rows are moved to join node. In the current patch, just one conversion is done. The pattern of project->filter->left-join is converted to project->anti-join. This will take care of sub queries with “not exists” clause. The queries with “not exists” are converted first to filter + left-join and then its converted to anti join. The queries with “not in” are not handled in the current patch. >From execution side, both merge join and map join with vectorized execution >is supported for anti join. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-22856) Hive LLAP external client not reading data from ArrowStreamReader fully
mahesh kumar behera created HIVE-22856: -- Summary: Hive LLAP external client not reading data from ArrowStreamReader fully Key: HIVE-22856 URL: https://issues.apache.org/jira/browse/HIVE-22856 Project: Hive Issue Type: Bug Reporter: mahesh kumar behera Assignee: mahesh kumar behera LlapArrowBatchRecordReader returns false when the ArrowStreamReader loadNextBatch returns column vector with 0 length. But we should keep reading data until loadNextBatch returns false. Some batch may return column vector of length 0, but we should ignore and wait for the next batch. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-22733) After disable operation log property in hive, still HS2 saving the operation log
mahesh kumar behera created HIVE-22733: -- Summary: After disable operation log property in hive, still HS2 saving the operation log Key: HIVE-22733 URL: https://issues.apache.org/jira/browse/HIVE-22733 Project: Hive Issue Type: Bug Reporter: mahesh kumar behera Assignee: mahesh kumar behera There are few issues in this area. 1. If logging is disabled using hive.server2.logging.operation.enabled, then operation logs for the queries should not be generated. But the registerLoggingContext method in LogUtils, registers the logging context even if the operation log is disabled. This causes the logs to be added by logger. The registration of query context should be done only if operation logging is enabled. {code:java} public static void registerLoggingContext(Configuration conf) { -MDC.put(SESSIONID_LOG_KEY, HiveConf.getVar(conf, HiveConf.ConfVars.HIVESESSIONID)); -MDC.put(QUERYID_LOG_KEY, HiveConf.getVar(conf, HiveConf.ConfVars.HIVEQUERYID)); if (HiveConf.getBoolVar(conf, HiveConf.ConfVars.HIVE_SERVER2_LOGGING_OPERATION_ENABLED)) { + MDC.put(SESSIONID_LOG_KEY, HiveConf.getVar(conf, HiveConf.ConfVars.HIVESESSIONID)); + MDC.put(QUERYID_LOG_KEY, HiveConf.getVar(conf, HiveConf.ConfVars.HIVEQUERYID)); MDC.put(OPERATIONLOG_LEVEL_KEY, HiveConf.getVar(conf, HiveConf.ConfVars.HIVE_SERVER2_LOGGING_OPERATION_LEVEL));{code} 2. In case of failed query, we close the operations and that deletes the logging context (appender and route) from logger for that query. But if any log is added after that, the query logs are getting added and new operation log file is getting generated for the query. This looks like issue with MCD clear. MCD clear is not removing the keys from the map. If remove is used instead of clear, its working fine. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-22695) DecimalColumnVector setElement throws class cast exception if input is of type LongColumnVector
mahesh kumar behera created HIVE-22695: -- Summary: DecimalColumnVector setElement throws class cast exception if input is of type LongColumnVector Key: HIVE-22695 URL: https://issues.apache.org/jira/browse/HIVE-22695 Project: Hive Issue Type: Bug Reporter: mahesh kumar behera Assignee: mahesh kumar behera Before casting the input to decimal type, the type should be checked. For long and double type, the value should be extracted and from that decimal type should be created. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-22365) "MetaException: Couldn't acquire the DB log notification lock because we reached the maximum # of retries" during metadata scale tests
mahesh kumar behera created HIVE-22365: -- Summary: "MetaException: Couldn't acquire the DB log notification lock because we reached the maximum # of retries" during metadata scale tests Key: HIVE-22365 URL: https://issues.apache.org/jira/browse/HIVE-22365 Project: Hive Issue Type: Bug Reporter: mahesh kumar behera Assignee: mahesh kumar behera The issue is because of the leaked open transaction in Objectstore::getPartition function. Here if jdo throws some exception during convertToPart then commit is not done. openTransaction(); MTable table = this.getMTable(catName, dbName, tableName); MPartition mpart = getMPartition(catName, dbName, tableName, part_vals); Partition part = convertToPart(mpart); commitTransaction(); Because of this, all subsequent transactions of this thread are not committed. {code:java} if ((openTrasactionCalls == 0) && currentTransaction.isActive()) { transactionStatus = TXN_STATUS.COMMITED; currentTransaction.commit(); } {code} This is causing the select for update lock on NOTIFICATION_SEQUENCE to never be released and all other threads are failing to get this lock and timing out. So the fix to do the operation is a try-catch block and rollback the txn in case of failure. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-22319) Repl load fails to create partition if the dump is from old version
mahesh kumar behera created HIVE-22319: -- Summary: Repl load fails to create partition if the dump is from old version Key: HIVE-22319 URL: https://issues.apache.org/jira/browse/HIVE-22319 Project: Hive Issue Type: Bug Reporter: mahesh kumar behera Assignee: mahesh kumar behera The engine field of column stats in partition descriptor needs to be initialized. Handling needs to be added for column stat events also. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-22272) Hive embedded HS2 throws metastore exceptions from MetastoreStatsConnector thread
mahesh kumar behera created HIVE-22272: -- Summary: Hive embedded HS2 throws metastore exceptions from MetastoreStatsConnector thread Key: HIVE-22272 URL: https://issues.apache.org/jira/browse/HIVE-22272 Project: Hive Issue Type: Bug Reporter: mahesh kumar behera Assignee: mahesh kumar behera Hive config is not passed to MetastoreStatsConnector. This causes RuntimeStatsLoader connects to embedded HMS (even tough HMS is configured to be remote) and causes metastore exceptions as metastore db will not be created. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-22234) Hive replication fails with table already exist error when replicating from old version of hive.
mahesh kumar behera created HIVE-22234: -- Summary: Hive replication fails with table already exist error when replicating from old version of hive. Key: HIVE-22234 URL: https://issues.apache.org/jira/browse/HIVE-22234 Project: Hive Issue Type: Bug Reporter: mahesh kumar behera Assignee: mahesh kumar behera HIve replication from old version where HIVE-22046 is not patched will not have engine column set in the table column stats. This causes "ERROR: null value in column "ENGINE" violates not-null constraint" error during create table while updating the column stats. As the column stats are updated after the create table txn is committed, the next retry by HMS client throws table already exist error. Need to update the ENGINE column to default value while importing the table if the column value is not set. The column stat and create table in same txn can be done as part of separate Jira. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-22197) Common Merge join throwing class cast exception
mahesh kumar behera created HIVE-22197: -- Summary: Common Merge join throwing class cast exception Key: HIVE-22197 URL: https://issues.apache.org/jira/browse/HIVE-22197 Project: Hive Issue Type: Bug Components: Hive Affects Versions: 4.0.0 Reporter: mahesh kumar behera Assignee: mahesh kumar behera Fix For: 4.0.0 In DummyStoreOperator the row is cached to fix HIVE-5973. The row is copyed and stored in the writable format, but the object inspector is initialized to default. So when join operator is fetching the data from dummy store operator, its getting the OI is Long and the row as LongWritable. This is causing the class cast exception. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (HIVE-22092) Fetch failing with IllegalArgumentException: No ValidTxnList when refetch is done
mahesh kumar behera created HIVE-22092: -- Summary: Fetch failing with IllegalArgumentException: No ValidTxnList when refetch is done Key: HIVE-22092 URL: https://issues.apache.org/jira/browse/HIVE-22092 Project: Hive Issue Type: Bug Reporter: mahesh kumar behera Assignee: mahesh kumar behera The fetch task is created during query compilation with the config of the driver. That config will have the valid txn list set. Thus the fetch task will have valid txn list set while doing fetch for ACID tables. But when user does a refetch with cusrsor set to first position it reinitializes the fetch task with the driver config (cached in task config). But by that time, the select query would have cleaned up the valid txn list from the config and the fetch will happen with valid txn list as null. This will cause illegal argument exception. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (HIVE-21974) The list of table expression in the inclusion and exclusion list should be separated by '|' instead of comma.
mahesh kumar behera created HIVE-21974: -- Summary: The list of table expression in the inclusion and exclusion list should be separated by '|' instead of comma. Key: HIVE-21974 URL: https://issues.apache.org/jira/browse/HIVE-21974 Project: Hive Issue Type: Sub-task Components: repl Reporter: mahesh kumar behera Assignee: mahesh kumar behera REPL DUMP fetches the events from NOTIFICATION_LOG table based on regular expression + inclusion/exclusion list. So, in case of rename table event, the event will be ignored if old table doesn't match the pattern but the new table should be bootstrapped. REPL DUMP should have a mechanism to detect such tables and automatically bootstrap with incremental replication.Also, if renamed table is excluded from replication policy, then need to drop the old table at target as well. There are 4 scenarios that needs to be handled. # Both new name and old name satisfies the table name pattern filter. ## No need to do anything. The incremental event for rename should take care of the replication. # Both the names does not satisfy the table name pattern filter. ## Both the names are not in the scope of the policy and thus nothing needs to be done. # New name satisfies the pattern but the old name does not. ## The table will not be present at the target. ## Rename event handler for dump should detect this case and add the new table name to the list of table for bootstrap. ## All the events related to the table (new name) should be ignored. ## If there is a drop event for the table (with new name), then remove the table from the list of tables to be bootstrapped. ## In case of rename (double rename) ### If the new name satisfies the table pattern, then add the new name to the list of tables to be bootstrapped and remove the old name from the list of tables to be bootstrapped. ### If the new name does not satisfies then just removed the table name from the list of tables to be bootstrapped. # New name does not satisfies the pattern but the old name satisfies. ## Change the rename event to a drop event. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21958) The list of table expression in the inclusion and exclusion list should be separated by '|' instead of comma.
mahesh kumar behera created HIVE-21958: -- Summary: The list of table expression in the inclusion and exclusion list should be separated by '|' instead of comma. Key: HIVE-21958 URL: https://issues.apache.org/jira/browse/HIVE-21958 Project: Hive Issue Type: Sub-task Reporter: mahesh kumar behera Assignee: mahesh kumar behera Java regex expression does not support comma. If user wants multiple expression to be present in the include or exclude list, then the expressions can be provided separated by pipe ('|') character. The policy will look something like db_name.'(t1*)|(t3)'.'t100' -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21956) Add the list of table selected by dump in the dump folder.
mahesh kumar behera created HIVE-21956: -- Summary: Add the list of table selected by dump in the dump folder. Key: HIVE-21956 URL: https://issues.apache.org/jira/browse/HIVE-21956 Project: Hive Issue Type: Sub-task Reporter: mahesh kumar behera Assignee: mahesh kumar behera The list of tables selected by a dump should be kept in the dump folder as a _tables file. This will help user to find out the tables replicated and the list can be used by user for ranger and atlas policy replication. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21926) CLONE - REPL - With table list - "TO" and "FROM" clause should not be allowed along with table filter list
mahesh kumar behera created HIVE-21926: -- Summary: CLONE - REPL - With table list - "TO" and "FROM" clause should not be allowed along with table filter list Key: HIVE-21926 URL: https://issues.apache.org/jira/browse/HIVE-21926 Project: Hive Issue Type: Sub-task Components: repl Reporter: mahesh kumar behera Assignee: mahesh kumar behera If some rename events are found to be dumped and replayed while replace policy is getting executed, it needs to take care of the policy inclusion in both the policy for each table name. 1. Create a list of tables to be bootstrapped. 2. During handling of alter table, if the alter type is rename 1. If the old table name is present in the list of table to be bootstrapped, remove it. 2. If the new table name, matches the new policy, add it to the list of tables to be bootstrapped. 3. During handling of drop table 1. if the table is in the list of tables to be bootstrapped, then remove it and ignore the event. 4. During other event handling 1. if the table is there in the list of tables to be bootstrapped, then ignore the event. Rename handling during replace policy # Old name not matching old policy – The old table will not be there at the target cluster. The table will not be returned by get-all-table. ## Old name is not matching new policy ### New name not matching old policy New name not matching new policy * Ignore the event, no need to do anything. New name matching new policy * The table will be returned by get-all-table. Replace policy handler will bootstrap this table as its matching new policy and not matching old policy. * All the future events will be ignored as part of check added by replace policy handling. * All the event with old table name will anyways be ignored as the old name is not matching the new policy. ### New name matching old policy New name not matching new policy * As the new name is not matching the new policy, the table need not be replicated. * As the old name is not matching the new policy, the rename events will be ignored. * So nothing to be done for this scenario. New name matching new policy * As the new name is matching both old and new policy, replace handler will not bootstrap the table. * Add the table to the list of tables to be bootstrapped. * Ignore all the events with new name. * If there is a drop event for the table (with new name), then remove the table from the the list of table to be bootstrapped. * In case of rename event (double rename) ** If the new name satisfies the table pattern, then add the new name to the list of tables to be bootstrapped and remove the old name from the list of tables to be bootstrapped. ** If the new name does not satisfies then just removed the table name from the list of tables to be bootstrapped. ## Old name is matching new policy – As per replace policy handler, which checks based on old table, the table should be bootstrapped and event should be ignored. But rename handler should decide based on new name.The old table name will not be returned by get-all-table, so replace handler will not d anything for the old table. ### New name not matching old policy New name not matching new policy * As the old table is not there at target and new name is not matching new policy. Ignore the event. * No need to add the table to the list of tables to be bootstrapped. * All the subsequent events will be ignored as the new name is not matching the new policy. New name matching new policy * As the new name is not matching old policy but matching new policy, the table will be bootstrapped by replace policy handler. So rename event need not add this table to list of table to be bootstrapped. * All the future events will be ignored by replace policy handler. * For rename event (double rename) ** If there is a rename, the table (with intermittent new name) will not be present and thus replace handler will not bootstrap the table. ** So if the new name (the latest one) is matching the new policy, then add it to the list of table to be bootstrapped. ** And If the new name (the latest one) is not matching the new policy, then just ignore the event as the intermittent new name would not have added to the list of table to be bootstrapped. ### New name matching old policy New name not matching new policy * Dump the event. The table will be dropped by repl load at the target. New name matching new policy * Replace handler will not bootstrap this table as the new name is matching both policies. * As old name is not matching the old policy, the table will not be there at target. The rename event should add the new
[jira] [Created] (HIVE-21886) REPL - With table list - Handle rename events during replace policy
mahesh kumar behera created HIVE-21886: -- Summary: REPL - With table list - Handle rename events during replace policy Key: HIVE-21886 URL: https://issues.apache.org/jira/browse/HIVE-21886 Project: Hive Issue Type: Sub-task Components: repl Reporter: mahesh kumar behera Assignee: mahesh kumar behera REPL DUMP fetches the events from NOTIFICATION_LOG table based on regular expression + inclusion/exclusion list. So, in case of rename table event, the event will be ignored if old table doesn't match the pattern but the new table should be bootstrapped. REPL DUMP should have a mechanism to detect such tables and automatically bootstrap with incremental replication.Also, if renamed table is excluded from replication policy, then need to drop the old table at target as well. There are 4 scenarios that needs to be handled. # Both new name and old name satisfies the table name pattern filter. ## No need to do anything. The incremental event for rename should take care of the replication. # Both the names does not satisfy the table name pattern filter. ## Both the names are not in the scope of the policy and this nothing needs to be done. # New name satisfies the pattern but the old name does not. ## The table will not be present at the target. ## Rename event handler for dump should detect this case and add the new table name to the list of table for bootstrap. ## All the events related to the table (new name) should be ignored. ## If there is a drop event for the table (with new name), then remove the table from the list of tables to be bootstrapped. ## In case of rename (double rename) ### If the new name satisfies the table pattern, then add the new name to the list of tables to be bootstrapped and remove the old name from the list of tables to be bootstrapped. ### If the new name does not satisfies then just removed the table name from the list of tables to be bootstrapped. # New name does not satisfies the pattern but the old name satisfies. ## Change the rename event to a drop event. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21844) HMS schema Upgrade Script is failing with NPE
mahesh kumar behera created HIVE-21844: -- Summary: HMS schema Upgrade Script is failing with NPE Key: HIVE-21844 URL: https://issues.apache.org/jira/browse/HIVE-21844 Project: Hive Issue Type: Task Components: HiveServer2 Affects Versions: 4.0.0 Reporter: mahesh kumar behera Assignee: mahesh kumar behera Fix For: 4.0.0 schema upgrade tool is failing with NPE while executing "SELECT 'Upgrading MetaStore schema from 1.2.0 to 2.0.0' AS ' '". The header row (metadata) is coming with rows having value null. This is causing null pointer access in function TableOutputFormat::getOutputString when row.values[i] is accessed. Instead of " AS ' ' ", if some other value like "AS dummy" is given, it's working fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21788) Support replication from haddop-2 (hive 3.0 and beelow) on-prem cluster to hadoop-3 (hive 4 and above) cloud cluster
mahesh kumar behera created HIVE-21788: -- Summary: Support replication from haddop-2 (hive 3.0 and beelow) on-prem cluster to hadoop-3 (hive 4 and above) cloud cluster Key: HIVE-21788 URL: https://issues.apache.org/jira/browse/HIVE-21788 Project: Hive Issue Type: Task Components: HiveServer2, repl Affects Versions: 4.0.0 Reporter: mahesh kumar behera Assignee: mahesh kumar behera Fix For: 4.0.0 In case of replication to cloud both dump and load are executed in the source cluster. This push based replication is done to avoid computation at target cloud cluster. In case in the source cluster, strict managed table is not set to true the tables will be non acid. So during replication to a cluster with strict managed table, migration logic same as upgrade tool has to be applied on the replicated data. This migration logic is implemented only in hive4.0. So it's required that a hive 4.0 instance started at the source cluster. If the source cluster has hadoop-2 installation, hive4 has to be built with hadoop-2 and necessary changes are required in the pom files and the shim files. 1. Change the pom.xml files to accept a profile for hadoop-2. If hadoop-2 profile is set, the hadoop version should be set accordingly to hadoop-2. 2. In shim creare a new file for hadoop-2. Based on the profile the respective file will be included in the build. 3. Changed artifactId hadoop-hdfs-client to hadoop-client as in hadoop-2 the jars are stored under hadoop-client folder. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21775) Handling partition level stat replication
mahesh kumar behera created HIVE-21775: -- Summary: Handling partition level stat replication Key: HIVE-21775 URL: https://issues.apache.org/jira/browse/HIVE-21775 Project: Hive Issue Type: Sub-task Components: HiveServer2, repl Affects Versions: 4.0.0 Reporter: mahesh kumar behera Assignee: mahesh kumar behera Fix For: 4.0.0 Statistics for table is maintained for all the partitions. The table level basic stats present in the table has the combined data for all the partitions. When only a few partitions are replicated, the replicated stats for the table may not be correct. In case of partition column stats, the aggregate stats from the partitions stats table will not correct. So the statistics replication can not be supported in case of partition levee replication. TODO : Need to check how to handle it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21774) Support partition level filtering for events with multiple partitions
mahesh kumar behera created HIVE-21774: -- Summary: Support partition level filtering for events with multiple partitions Key: HIVE-21774 URL: https://issues.apache.org/jira/browse/HIVE-21774 Project: Hive Issue Type: Sub-task Components: HiveServer2, repl Affects Versions: 4.0.0 Reporter: mahesh kumar behera Assignee: mahesh kumar behera Fix For: 4.0.0 Some of the events in hive can span across multiple partitions, table or even database. Events related to transactions, can span across multiple databases. When a transaction does some write operation, it is added to the write notification log table. During dump of commit transaction event, al the entries present in the write notification log table for that transaction is read and is added to the commit transaction message. In case partition filter is supplied for the dump, only those partitions which are part of the policy should be added to the commit txn message. * All the events which are not partition level will be added to the list of events to be dumped. * Pass the filter condition for the policy to commit transaction message handler (events which are not partition level). * During dump for commit transaction event, extract the events added in the write notification log table and compare it with the filter condition. * If the event from write notification log satisfies the filter condition, then add it to the commit transaction message. * If filter condition is null, then add all the events from write notification log table to commit transaction message. * For events which does not have partition level info like open txn, abort txn etc, just dump the events without any filtering. So it may happen that some of events which are not related to any of the satisfying partition, may get replayed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21773) Supporting external table replication with partition filter.
mahesh kumar behera created HIVE-21773: -- Summary: Supporting external table replication with partition filter. Key: HIVE-21773 URL: https://issues.apache.org/jira/browse/HIVE-21773 Project: Hive Issue Type: Sub-task Components: HiveServer2, repl Affects Versions: 4.0.0 Reporter: mahesh kumar behera Assignee: mahesh kumar behera Fix For: 4.0.0 Hive external table replication is done differently than managed table replication. In case of external table, list is created for the locations of the table and partitions to be replicated. If the partition location is within the table location, then partition location is not added to the list. For partitions with location outside table, partition location is added to the list. In case of incremental dump, the data related events are ignored and just the metadata related events are dumped. The list of location is prepared and that is used for replication. During load, the events are replayed and then the distcp tasks are created, one for each location present in the list. For partition level replication, not all partition will be present in the dump. So even if the partition locations are within the table location, each partition location will be added to the list. * If where condition is present in the REPL DUMP command then add location for each satisfying partition even though the partition location is within table location. * If table is not mentioned in the where clause then follow the older behavior. * If table is mentioned with a key but the key does not match any of the partitioned column then fail repl dump. * If the table is mentioned with the key and even if all the partitions are satisfying the filter condition, add location for each partition. This is to avoid copying partitions which are added using alter after the dump. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21772) Support dynamic addition and deletion of partitions in the policy
mahesh kumar behera created HIVE-21772: -- Summary: Support dynamic addition and deletion of partitions in the policy Key: HIVE-21772 URL: https://issues.apache.org/jira/browse/HIVE-21772 Project: Hive Issue Type: Sub-task Affects Versions: 4.0.0 Reporter: mahesh kumar behera Assignee: mahesh kumar behera Fix For: 4.0.0 If the user modify the filter condition in the policy, then the participating partitions of the policy can change. During such scenarios, user needs to provide the old filter condition along with the REPL DUMP command. * The old filter will be passed as a string along with ‘with’ clause of the REPL dump command. Need to create the AST from the string to be used for filtering. * Convert the string to list of AST, one for each table and make a list of the partitions satisfying the old filter condition. * List of partition satisfying the new filter condition will be compared with the old list. * If the partition is not present in old but is present in new, then the partition will be added to the list of partitions to be bootstrapped. * If the partition is present in old, but not present in new then the partition will be added to the list of partitions to be deleted. * During load operation, after all the events are replayed, the list of bootstrap and list of deleted will be read and corresponding action will be executed at target. * There is a possibility that the partitioned to be deleted is already deleted using some event replayed, in that case delete will be ignored. * Similarly if some partition from the list of bootstrap is already present, then bootstrap will be ignored. * As the partition can not be present in both bootstrap and delete list, the list can be executed in parallel. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21771) Support partition filter (where clause) in REPL dump command
mahesh kumar behera created HIVE-21771: -- Summary: Support partition filter (where clause) in REPL dump command Key: HIVE-21771 URL: https://issues.apache.org/jira/browse/HIVE-21771 Project: Hive Issue Type: Sub-task Components: HiveServer2, repl Affects Versions: 4.0.0 Reporter: mahesh kumar behera Assignee: mahesh kumar behera Fix For: 4.0.0 *Bootstrap for managed table* User should be allowed to execute REPL DUMP with where clause. The where clause should support filtering out partition from dump. Format of the where clause should be similar to *"REPL DUMP dbname from 10 where t0 where key < 10, t1* where key = 3, [t2*,t3] where key > 3".* For initial version, very basic filter condition will be supported and later the complexity will be increased as and when required. * From the AST generated for the where clause, extract the table information. * Generate AST for each table. * List the partition for each table using the AST generated for each table using the same metastore API used by select query. * During bootstrap load use the partition list to dump the partitions. * During incremental dump, use the list to filter out the event. In case of bootstrap load, all the tables of the database will be scanned and * If table is not partitioned, then it will be dumped. * If key provided in the filter condition for the table is not a partition column, then dump will fail. * If table is not mentioned in the where clause, then all partitions of the table will be dumped. * All the partitioned of the table satisfying the where clause will be dumped. *Incremental for managed table* In case of Incremental Dump, the events from the notification log will be scanned and once the partition spec is extracted from the event, the partition spec will be filtered against the condition. * If table is not partitioned then the event will be added to the dump. * If key mentioned is not a partition column, then dump will fail. * If the table is not mentioned in the filter then event will be added to the dump. * If the event is multi partitioned, then the event will be added to the dump. (Filtering out redundant partitions from message will be done as part of separate task). * If the partition spec matches the filter, then the event will be added to the dump*.* -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21770) Support extraction of replication spec from notification event.
mahesh kumar behera created HIVE-21770: -- Summary: Support extraction of replication spec from notification event. Key: HIVE-21770 URL: https://issues.apache.org/jira/browse/HIVE-21770 Project: Hive Issue Type: Sub-task Components: HiveServer2, repl Affects Versions: 4.0.0 Reporter: mahesh kumar behera Assignee: mahesh kumar behera Fix For: 4.0.0 The notification event structure currently does not have the partition spec. Events which can span across multiple databases and tables, the database and table info can not be obtained from the event structure. To know the event is added for which partition, the event message has to be deserialized and the partition information can be obtained from it. * Each event handler has to expose a static API. * The API should take the event as input and return the list of db name, table name and partition spec from it. * If database name, table name or partition name is present in the event structure, then return it. If all these info are present then no need to deserialize the message. Later if these info are added to the event structure then it will be useful. * Deserialize the message and create the list of name and return through a partition info class object. * If the table is not partitioned or is table level event, then set the partition info as null. Same for table info incase of db level events. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21769) Support Partition level filtering for hive replication command
mahesh kumar behera created HIVE-21769: -- Summary: Support Partition level filtering for hive replication command Key: HIVE-21769 URL: https://issues.apache.org/jira/browse/HIVE-21769 Project: Hive Issue Type: Task Reporter: mahesh kumar behera Assignee: mahesh kumar behera # User should be able to dump and load events satisfying a filter based on partition specification. # The partitions included in each dump is not constant and may vary between dumps. # User should be able to modify the policy in between to include/exclude partitions. # Only simple filter operator like >, <, >=, <= , ==, and, or against constants will be supported. # Configuration – Time Interval to filter out partitions if partition specification represents time (using ‘with’ clause in dump command). -- Will not be supported in first version. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21766) Select * returns no rows in hive bootstrap from a static or dynamic partitioned managed table with Timestamp type as partition column from on prem to WASB even though cou
mahesh kumar behera created HIVE-21766: -- Summary: Select * returns no rows in hive bootstrap from a static or dynamic partitioned managed table with Timestamp type as partition column from on prem to WASB even though count ( * ) matches Key: HIVE-21766 URL: https://issues.apache.org/jira/browse/HIVE-21766 Project: Hive Issue Type: Bug Affects Versions: 4.0.0 Reporter: mahesh kumar behera Assignee: mahesh kumar behera Fix For: 4.0.0 *Cause:* REPL LOAD replicates Txn State (writeIds of tables) to the target HMS (backend RDBMS). But, in this case, it is still connected to source HMS due to configs passed in WITH clause were not stored in HiveTxnManager. We pass the config object to the ReplTxnTask objects but HiveTxnManager was created by Driver using session config object. *Fix:* We need to pass it to HiveTxnManager too by creating a txn manager for repl txn operations with the config passed by user. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21731) Hive import fails, post upgrade of source 3.0 cluster, to a target 4.0 cluster with strict managed table set to true
mahesh kumar behera created HIVE-21731: -- Summary: Hive import fails, post upgrade of source 3.0 cluster, to a target 4.0 cluster with strict managed table set to true Key: HIVE-21731 URL: https://issues.apache.org/jira/browse/HIVE-21731 Project: Hive Issue Type: Bug Reporter: mahesh kumar behera Assignee: mahesh kumar behera The scenario is # Replication policy is set with hive 3.0 source cluster (strict managed table set to false) and hive 4.0 target cluster with strict managed table set true. # User upgrades the 3.0 source cluster to 4.0 cluster using upgrade tool. # The upgrade converts all managed tables to acid tables. # In the next repl dump, user sets hive .repl .dump .include .acid .tables and hive .repl .bootstrap. acid. tables set true triggering bootstrap of newly converted ACID tables. # As the old tables are non-txn tables, dump is not filtering the events even tough bootstrap acid table is set to true. This is causing the repl load to fail as the write id is not set in the table object. # If we ignore the event replay, the bootstrap is failing with dump directory mismatch error. The fix should be # Ignore dumping the alter table event if bootstrap acid table is set true and the alter is converting a non-acid table to acid table. # In case of bootstrap during incremental load, ignore the dump directory property set in table object. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21722) REPL::END event log is not included in hiveStatement.getQueryLog output.
mahesh kumar behera created HIVE-21722: -- Summary: REPL::END event log is not included in hiveStatement.getQueryLog output. Key: HIVE-21722 URL: https://issues.apache.org/jira/browse/HIVE-21722 Project: Hive Issue Type: Bug Components: HiveServer2, repl Affects Versions: 4.0.0 Reporter: mahesh kumar behera Assignee: mahesh kumar behera Fix For: 4.0.0 getQueryLog only reads logs from Background thread scope. If parallel execution is set to true, a new thread is created for execution and all the logs added by the new thread are not added to the parent Background thread scope. In replication scope, replStateLogTask are started in parallel mode causing the logs to be skipped from getQueryLog scope. There is one more issue, with the conf is not passed while creating replStateLogTask during bootstrap load end. The same issue is there with event load during incremental load. The incremental load end log task is created with the proper config. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21717) Rename is failing for directory in move task
mahesh kumar behera created HIVE-21717: -- Summary: Rename is failing for directory in move task Key: HIVE-21717 URL: https://issues.apache.org/jira/browse/HIVE-21717 Project: Hive Issue Type: Bug Components: HiveServer2 Affects Versions: 4.0.0 Reporter: mahesh kumar behera Assignee: mahesh kumar behera Fix For: 4.0.0 Rename fails with destination directory not empty in case a directory is move directly to the table location from staging directory as rename cannot overwrite non empty destination directory. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21712) Replication scenarios should tested with hive.strict.managed.tables set to true
mahesh kumar behera created HIVE-21712: -- Summary: Replication scenarios should tested with hive.strict.managed.tables set to true Key: HIVE-21712 URL: https://issues.apache.org/jira/browse/HIVE-21712 Project: Hive Issue Type: Bug Components: Hive, repl Affects Versions: 4.0.0 Reporter: mahesh kumar behera Assignee: mahesh kumar behera Fix For: 4.0.0 In replication test suites, in some cases the tables are created with transactional property set to non-acid and thus the intended tests are missing. By setting the default value of hive.strict.managed.tables to true in replication related test suites, the tables will be created as ACID tables by default. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21700) hive incremental load going OOM while adding load task to the leaf nodes of the DAG
mahesh kumar behera created HIVE-21700: -- Summary: hive incremental load going OOM while adding load task to the leaf nodes of the DAG Key: HIVE-21700 URL: https://issues.apache.org/jira/browse/HIVE-21700 Project: Hive Issue Type: Bug Components: Hive Affects Versions: 4.0.0 Reporter: mahesh kumar behera Assignee: mahesh kumar behera Fix For: 4.0.0 While listing thee child nodes to check for leaf node, we need to filter out tasks which are already added to the children list. If a task is added multiple time to the children list then it may cause the list to grow exponentially. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21694) Hive driver waiting time is fixed for task getting executed in parallel.
mahesh kumar behera created HIVE-21694: -- Summary: Hive driver waiting time is fixed for task getting executed in parallel. Key: HIVE-21694 URL: https://issues.apache.org/jira/browse/HIVE-21694 Project: Hive Issue Type: Bug Components: HiveServer2 Affects Versions: 4.0.0 Reporter: mahesh kumar behera Assignee: mahesh kumar behera Fix For: 4.0.0 During a command execution hive driver executes the task in a separate thread if the task to be executed is set as parallel. After starting the task, driver checks if the task has finished execution or not. If the task execution is not finished it waits for 2 seconds before waking up again to check the task status. In case of task with execution time in milliseconds, this wait time can induce substantial overhead. So instead of fixed wait time, exponential backedup sleep time can be used to reduce the sleep overhead. The sleep time can start with 100ms and can increase up to 2 seconds doubling on each iteration. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21566) Support locking during ACID table replication
mahesh kumar behera created HIVE-21566: -- Summary: Support locking during ACID table replication Key: HIVE-21566 URL: https://issues.apache.org/jira/browse/HIVE-21566 Project: Hive Issue Type: Bug Reporter: mahesh kumar behera Assignee: mahesh kumar behera During load of ACID table we need to take lock. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21450) Buffer Reader is not closed during executeInitSql
mahesh kumar behera created HIVE-21450: -- Summary: Buffer Reader is not closed during executeInitSql Key: HIVE-21450 URL: https://issues.apache.org/jira/browse/HIVE-21450 Project: Hive Issue Type: Bug Components: JDBC Affects Versions: 4.0.0 Reporter: mahesh kumar behera Assignee: mahesh kumar behera Fix For: 4.0.0 The buffer reader should be opened within try block to close it after execution -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21446) Hive Server going OOM during hive external table replications
mahesh kumar behera created HIVE-21446: -- Summary: Hive Server going OOM during hive external table replications Key: HIVE-21446 URL: https://issues.apache.org/jira/browse/HIVE-21446 Project: Hive Issue Type: Bug Components: repl Affects Versions: 4.0.0 Reporter: mahesh kumar behera Assignee: mahesh kumar behera Fix For: 4.0.0 The file system objects opened using proxy users are not closed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21325) Hive external table replication failed with Permission denied issue.
mahesh kumar behera created HIVE-21325: -- Summary: Hive external table replication failed with Permission denied issue. Key: HIVE-21325 URL: https://issues.apache.org/jira/browse/HIVE-21325 Project: Hive Issue Type: Bug Affects Versions: 4.0.0 Reporter: mahesh kumar behera Assignee: mahesh kumar behera Fix For: 4.0.0 During external table replication the file copy is done in parallel to the meta data replication. If the file copy task creates the directory with do as set to true, it will create the directory with permission set to the user running the repl command. In that case the meta data task while creating the table may fail as hive user might not have access to the created directory. The fix should be # While creating directory, if sql based authentication is enabled, then disable storage based authentication for hive user. # Currently the created directory has the login user access, it should retain the source clusters owner, group and permission. # For external table replication don't create the directory during create table and add partition. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21314) Hive Replication not retaining the owner in the replicated table
mahesh kumar behera created HIVE-21314: -- Summary: Hive Replication not retaining the owner in the replicated table Key: HIVE-21314 URL: https://issues.apache.org/jira/browse/HIVE-21314 Project: Hive Issue Type: Bug Reporter: mahesh kumar behera Assignee: mahesh kumar behera Hive Replication not retaining the owner in the replicated table. The owner for the target table is set same as the user executing the load command. The user information should be read from the dump metadata and should be used while creating the table at target cluster. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21260) Hive 3 (onprem) -> 4(onprem): Hive replication failed due to postgres sql execution issue
mahesh kumar behera created HIVE-21260: -- Summary: Hive 3 (onprem) -> 4(onprem): Hive replication failed due to postgres sql execution issue Key: HIVE-21260 URL: https://issues.apache.org/jira/browse/HIVE-21260 Project: Hive Issue Type: Bug Affects Versions: 4.0.0 Reporter: mahesh kumar behera Assignee: mahesh kumar behera Fix For: 4.0.0 Missing quotes in sql string is causing sql execution error for postgres. {code:java} metastore.RetryingHMSHandler (RetryingHMSHandler.java:invokeInternal(201)) - MetaException(message:Unable to update transaction database org.postgresql.util.PSQLException: ERROR: relat ion "database_params" does not exist Position: 25 at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2284) at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2003) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:200) at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:424) at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:321) at org.postgresql.jdbc.PgStatement.executeQuery(PgStatement.java:284) at com.zaxxer.hikari.pool.ProxyStatement.executeQuery(ProxyStatement.java:108) at com.zaxxer.hikari.pool.HikariProxyStatement.executeQuery(HikariProxyStatement.java) at org.apache.hadoop.hive.metastore.txn.TxnHandler.updateReplId(TxnHandler.java:907) at org.apache.hadoop.hive.metastore.txn.TxnHandler.commitTxn(TxnHandler.java:1023) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.commit_txn(HiveMetaStore.java:7703) at sun.reflect.GeneratedMethodAccessor43.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:147) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:108) at com.sun.proxy.$Proxy39.commit_txn(Unknown Source) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$commit_txn.getResult(ThriftHiveMetastore.java:18730) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$commit_txn.getResult(ThriftHiveMetastore.java:18714) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hadoop.hive.metastore.security.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor$1.run(HadoopThriftAuthBridge.java:636) at org.apache.hadoop.hive.metastore.security.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor$1.run(HadoopThriftAuthBridge.java:631) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.hive.metastore.security.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor.process(HadoopThriftAuthBridge.java:631) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21213) Acid table bootstrap replication needs to handle directory created by compaction with txn id
mahesh kumar behera created HIVE-21213: -- Summary: Acid table bootstrap replication needs to handle directory created by compaction with txn id Key: HIVE-21213 URL: https://issues.apache.org/jira/browse/HIVE-21213 Project: Hive Issue Type: Sub-task Components: Hive, HiveServer2, repl Reporter: mahesh kumar behera Assignee: mahesh kumar behera The current implementation of compaction makes use of compaction to use the txn id in the directory name. This is used to isolate the queries from reading the directory until compaction has finished. In case of replication, the directory can not be copied as the txn list at target may be different from source. So conversion logic is required to create a new directory with valid txn at target and dump the data to the newly created directory. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21197) Hive Replication can add duplicate data during migration from 3.0 to 4
mahesh kumar behera created HIVE-21197: -- Summary: Hive Replication can add duplicate data during migration from 3.0 to 4 Key: HIVE-21197 URL: https://issues.apache.org/jira/browse/HIVE-21197 Project: Hive Issue Type: Task Components: repl Reporter: mahesh kumar behera Assignee: mahesh kumar behera During bootstrap phase it may happen that the files copied to target are created by events which are not part of the bootstrap. This is because of the fact that, bootstrap first gets the last event id and then the file list. So during this period if some event happens, then bootstrap will include files created by these events also. So the same files will be copied again during the first incremental replication just after the bootstrap. In normal scenario, the duplicate copy does not cause any issue as hive allows the use of target database only after the first incremental. But in case of migration, the file at source and target are copied to different location (based on the write id at target) and thus this may lead to duplicate data at target. This can be avoided by having at check at load time for duplicate file. This check can be done only for the first incremental and the search can be done in the bootstrap directory (with write id 1). if the file is already present then just ignore the copy. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21063) Support statistics in cachedStore for transactional table
mahesh kumar behera created HIVE-21063: -- Summary: Support statistics in cachedStore for transactional table Key: HIVE-21063 URL: https://issues.apache.org/jira/browse/HIVE-21063 Project: Hive Issue Type: Task Reporter: mahesh kumar behera Currently statistics for transactional table is not stored in cached store for consistency issues. Need to add validation for valid write ids and generation of aggregate stats based on valid partitions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21055) Replication to a target cluster with hive.strict.managed.tables enabled executing copy in serial mode
mahesh kumar behera created HIVE-21055: -- Summary: Replication to a target cluster with hive.strict.managed.tables enabled executing copy in serial mode Key: HIVE-21055 URL: https://issues.apache.org/jira/browse/HIVE-21055 Project: Hive Issue Type: Bug Reporter: mahesh kumar behera Assignee: mahesh kumar behera For repl load command use can specify the execution mode as part of "with" clause. But the config for executing task in parallel or serial is not read from the command specific config. It is read from the hive server config. So even if user specifies to run the tasks in parallel during repl load command, the tasks are getting executed serially. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21023) Add test for replication to a target with hive.strict.managed.tables enabled
mahesh kumar behera created HIVE-21023: -- Summary: Add test for replication to a target with hive.strict.managed.tables enabled Key: HIVE-21023 URL: https://issues.apache.org/jira/browse/HIVE-21023 Project: Hive Issue Type: Bug Reporter: mahesh kumar behera Assignee: mahesh kumar behera Tests added are timing out in ptest run. Need to skip these test cases from batching and run them separately. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20966) Support incremental replication to a target cluster with hive.strict.managed.tables enabled.
mahesh kumar behera created HIVE-20966: -- Summary: Support incremental replication to a target cluster with hive.strict.managed.tables enabled. Key: HIVE-20966 URL: https://issues.apache.org/jira/browse/HIVE-20966 Project: Hive Issue Type: New Feature Components: repl Affects Versions: 4.0.0 Reporter: mahesh kumar behera Assignee: Sankar Hariappan *Requirements:* - Support Hive replication with Hive2 as master and Hive3 as slave where hive.strict.managed.tables is enabled. - The non-ACID managed tables from Hive2 should be converted to appropriate ACID or MM tables or to an external table based on Hive3 table type rules. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20967) CLONE - REPL DUMP to dump the default warehouse directory of source.
mahesh kumar behera created HIVE-20967: -- Summary: CLONE - REPL DUMP to dump the default warehouse directory of source. Key: HIVE-20967 URL: https://issues.apache.org/jira/browse/HIVE-20967 Project: Hive Issue Type: Sub-task Components: repl Affects Versions: 4.0.0 Reporter: mahesh kumar behera Assignee: Sankar Hariappan The default warehouse directory of the source is needed by target to detect if DB or table location is set by user or assigned by Hive. Using this information, REPL LOAD will decide to preserve the path or move data to default managed table's warehouse directory. -- This message was sent by Atlassian JIRA (v7.6.3#76005)