[jira] [Created] (HIVE-26394) Query based compaction fails for table with more than 6 columns

2022-07-14 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-26394:
--

 Summary: Query based compaction fails for table with more than 6 
columns
 Key: HIVE-26394
 URL: https://issues.apache.org/jira/browse/HIVE-26394
 Project: Hive
  Issue Type: Bug
  Components: Hive, HiveServer2
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


Query based compaction creates a temp external table with location pointing to 
the location of the table being compacted. So this external table has file of 
ACID type. When query is done on this table, the table type is decided by 
reading the files present at the table location. As the table location has 
files compatible to ACID format, it is assuming it to be ACID table. This is 
causing issue while generating the SARG columns as the column number does not 
match with the schema.

 
{code:java}
Error doing query based minor compaction
org.apache.hadoop.hive.ql.metadata.HiveException: Failed to run INSERT into 
table delta_cara_pn_tmp_compactor_clean_1656061070392_result select 
`operation`, `originalTransaction`, `bucket`, `rowId`, `currentTransaction`, 
`row` from delta_clean_1656061070392 where `originalTransaction` not in 
(749,750,766,768,779,783,796,799,818,1145,1149,1150,1158,1159,1160,1165,1166,1169,1173,1175,1176,1871,9631)
at 
org.apache.hadoop.hive.ql.DriverUtils.runOnDriver(DriverUtils.java:73)
at 
org.apache.hadoop.hive.ql.txn.compactor.QueryCompactor.runCompactionQueries(QueryCompactor.java:138)
at 
org.apache.hadoop.hive.ql.txn.compactor.MinorQueryCompactor.runCompaction(MinorQueryCompactor.java:70)
at 
org.apache.hadoop.hive.ql.txn.compactor.Worker.findNextCompactionAndExecute(Worker.java:498)
at 
org.apache.hadoop.hive.ql.txn.compactor.Worker.lambda$run$0(Worker.java:120)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: (responseCode = 2, errorMessage = FAILED: Execution Error, return 
code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, 
vertexName=Map 1, vertexId=vertex_1656061159324__1_00, diagnostics=[Task 
failed, taskId=task_1656061159324__1_00_00, diagnostics=[TaskAttempt 0 
failed, info=[Error: Error while running task ( failure ) : 
attempt_1656061159324__1_00_00_0:java.lang.RuntimeException: 
java.lang.RuntimeException: java.io.IOException: 
java.lang.ArrayIndexOutOfBoundsException: 6
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:348)
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:277)
at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
at 
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:75)
at 
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:62)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898)
at 
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:62)
at 
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:38)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at 
org.apache.hadoop.hive.llap.daemon.impl.StatsRecordingThreadPool$WrappedCallable.call(StatsRecordingThreadPool.java:118)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.RuntimeException: java.io.IOException: 
java.lang.ArrayIndexOutOfBoundsException: 6
at 
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:206)
at 
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.(TezGroupedSplitsInputFormat.java:145)
at 
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat.getRecordReader(TezGroupedSplitsInputFormat.java:111)
at 
org.apache.tez.mapreduce.lib.MRReaderMapred.setupOldRecordReader(MRReaderMapred.java:164)
at 
org.apache.tez.mapreduce.lib.MRReaderMapred.setSplit(MRReaderMapred.java:83)
at 
org.apache.tez.mapreduce.input.MRInput.initFromEventInternal(MRInput.java:706)

[jira] [Created] (HIVE-26382) Stats generation fails during CTAS for external partitioned table.

2022-07-11 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-26382:
--

 Summary: Stats generation fails during CTAS for external 
partitioned table.
 Key: HIVE-26382
 URL: https://issues.apache.org/jira/browse/HIVE-26382
 Project: Hive
  Issue Type: Bug
  Components: Hive, HiveServer2
Affects Versions: 4.0.0-alpha-1
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


As part of HIVE-25990 manifest file was generated to list out the files to be 
moved. The files are moved in move task by referring to the manifest file. For 
partitioned table flow, the move is not done. This prevents the dynamic 
partition creation as the target path will be empty. As stats task needs the 
partition information, this causes the stat task to fail.

 
{code:java}
class="metastore.RetryingHMSHandler" level="ERROR" thread="pool-10-thread-144"] 
MetaException(message:Unable to update Column stats for  ext_par due to: The IN 
list is empty!)
 
org.apache.hadoop.hive.metastore.DirectSqlUpdateStat.updatePartitionColumnStatistics(DirectSqlUpdateStat.java:634)
 
org.apache.hadoop.hive.metastore.MetaStoreDirectSql.updatePartitionColumnStatisticsBatch(MetaStoreDirectSql.java:2803)
 
org.apache.hadoop.hive.metastore.ObjectStore.updatePartitionColumnStatisticsInBatch(ObjectStore.java:10001)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43
 java.lang.reflect.Method.invoke(Method.java:498)
 org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:97)
com.sun.proxy.$Proxy33.updatePartitionColumnStatisticsInBatch(Unknown Source)
 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.updatePartitionColStatsForOneBatch(HiveMetaStore.java:7124)
 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.updatePartitionColStatsInBatch(HiveMetaStore.java:7109)
 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-26222) Native GeoSpatial Support in Hive

2022-05-11 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-26222:
--

 Summary: Native GeoSpatial Support in Hive
 Key: HIVE-26222
 URL: https://issues.apache.org/jira/browse/HIVE-26222
 Project: Hive
  Issue Type: Task
  Components: Hive, HiveServer2
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


This is an epic Jira to support GeoSpatial datatypes natively in Hive. This 
will cater to the applications which queries on large volumes of spatial data. 
This support will be added in a phased manner. To start with, we are planning 
to make use of framework developed by ESRI 
([https://github.com/Esri/spatial-framework-for-hadoop).]   This project is not 
very active and there is no release done to maven central. So its not easy to 
get the jars downloaded directly using pom dependency. Also the UDFs are based 
on older version of Hive. So we have decided to make a copy of this repo and 
maintained it inside Hive. This will make it easier to do any improvement and 
manage dependencies. As of now, the data loading is done only on a binary data 
type. We need to enhance this  to make it more user friendly. In the next 
phase, a native Geometry/Geography datatype will be supported. User can 
directly create a geometry type and operate on it. Apart from these we can 
start adding support for different indices like quad tree, R-tree, 
ORC/Parquet/Iceberg support etc. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (HIVE-26105) Show columns shows extra values if column comments contains specific Chinese character

2022-03-31 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-26105:
--

 Summary: Show columns shows extra values if column comments 
contains specific Chinese character 
 Key: HIVE-26105
 URL: https://issues.apache.org/jira/browse/HIVE-26105
 Project: Hive
  Issue Type: Bug
  Components: Hive, HiveServer2
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


The issue is happening because the UTF code for one of the Chinese character 
contains the binary value of '\r' (CR). Because of this, the Hadoop line reader 
(used by fetch task in Hive) is assuming the value after that character as new 
value and this extra value with junk is getting displayed. The issue is with 
0x540D 名 ... The last value is "D" ..that is 13. While reading the result, 
Hadoop line reader interpreting it as CR ( '\r'). Thus an extra value with Junk 
is coming as output. For show column, we do not need the comments. So while 
writing to the file, only column names should be included.

[https://github.com/apache/hadoop/blob/0fbd96a2449ec49f840d93e1c7d290c5218ef4ea/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java#L238]

 
{code:java}
create table tbl_test  (fld0 string COMMENT  '期 ' , fld string COMMENT '期末日期', 
fld1 string COMMENT '班次名称', fld2  string COMMENT '排班人数');

show columns from tbl_test;
++
| field  |
++
| fld    |
| fld0   |
| fld1   |
| �      |
| fld2   |
++
5 rows selected (171.809 seconds)
 {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HIVE-26098) Duplicate path/Jar in hive.aux.jars.path or hive.reloadable.aux.jars.path causing IllegalArgumentException

2022-03-31 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-26098:
--

 Summary: Duplicate path/Jar in hive.aux.jars.path or 
hive.reloadable.aux.jars.path causing IllegalArgumentException
 Key: HIVE-26098
 URL: https://issues.apache.org/jira/browse/HIVE-26098
 Project: Hive
  Issue Type: Bug
  Components: Hive, HiveServer2
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


 hive.aux.jars.path and hive.reloadable.aux.jars.path  are used for providing 
auxiliary jars which are used doing query processing. These jars are copied to 
Tez temp path so that the Tez jobs have access to these jars while processing 
the job. There are a duplicate check to avoid copying the same jar multiple 
times. This check assumes the jar to be in local file system. But in real, the 
jars path can be anywhere. So this duplicate check fails, when the source path 
is not in local path.
{code:java}
ERROR : Failed to execute tez graph.
java.lang.IllegalArgumentException: Wrong FS: 
hdfs://localhost:53877/tmp/test_jar/identity_udf.jar, expected: file:///
    at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:781) 
~[hadoop-common-3.1.0.jar:?]
    at 
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:86) 
~[hadoop-common-3.1.0.jar:?]
    at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:636)
 ~[hadoop-common-3.1.0.jar:?]
    at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:930)
 ~[hadoop-common-3.1.0.jar:?]
    at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:631)
 ~[hadoop-common-3.1.0.jar:?]
    at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:454) 
~[hadoop-common-3.1.0.jar:?]
    at 
org.apache.hadoop.hive.ql.exec.tez.DagUtils.checkPreExisting(DagUtils.java:1392)
 ~[hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
    at 
org.apache.hadoop.hive.ql.exec.tez.DagUtils.localizeResource(DagUtils.java:1411)
 ~[hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
    at 
org.apache.hadoop.hive.ql.exec.tez.DagUtils.addTempResources(DagUtils.java:1295)
 ~[hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
    at 
org.apache.hadoop.hive.ql.exec.tez.DagUtils.localizeTempFilesFromConf(DagUtils.java:1177)
 ~[hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
    at 
org.apache.hadoop.hive.ql.exec.tez.TezSessionState.ensureLocalResources(TezSessionState.java:636)
 ~[hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
    at 
org.apache.hadoop.hive.ql.exec.tez.TezSessionState.openInternal(TezSessionState.java:283)
 ~[hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
    at 
org.apache.hadoop.hive.ql.exec.tez.TezSessionPoolSession.openInternal(TezSessionPoolSession.java:124)
 ~[hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
    at 
org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:241)
 ~[hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
    at 
org.apache.hadoop.hive.ql.exec.tez.TezTask.ensureSessionHasResources(TezTask.java:448)
 ~[hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
    at org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:215) 
[hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
    at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:212) 
[hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
    at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105) 
[hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
    at org.apache.hadoop.hive.ql.Executor.launchTask(Executor.java:361) 
[hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
    at org.apache.hadoop.hive.ql.Executor.launchTasks(Executor.java:334) 
[hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
    at org.apache.hadoop.hive.ql.Executor.runTasks(Executor.java:245) 
[hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
    at org.apache.hadoop.hive.ql.Executor.execute(Executor.java:106) 
[hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
    at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:348) 
[hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
    at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:204) 
[hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
    at org.apache.hadoop.hive.ql.Driver.run(Driver.java:153) 
[hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
    at org.apache.hadoop.hive.ql.Driver.run(Driver.java:148) 
[hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
    at org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:185) 
[hive-exec-4.0.0-alpha-1.jar:4.0.0-alpha-1]
    at 
org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:233)
 [hive-service-4.0.0-alpha-1.jar:4.0.0-alpha-1]
    at 
org.apache.hive.service.cli.operation.SQLOperation.access$500(SQLOperation.java:88)
 [hive-service-4.0.0-alpha-1.jar:4.0.0-alpha-1]
    at 
org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:336)
 [hive-service-4.0.0-alpha-1.jar:4.0.0-alpha-1]
    at 

[jira] [Created] (HIVE-26017) Insert with partition value containing colon and space is creating partition having wrong value

2022-03-09 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-26017:
--

 Summary: Insert with partition value containing colon and space is 
creating partition having wrong value
 Key: HIVE-26017
 URL: https://issues.apache.org/jira/browse/HIVE-26017
 Project: Hive
  Issue Type: Bug
  Components: HiveServer2
Affects Versions: 4.0.0
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


The path used for generating the dynamic partition value is obtained from uri. 
This is causing the serialised value to be used for partition name generation 
and wrong names are generated. The path value should be used, not the URI.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HIVE-25877) Load table from concurrent thread causes FileNotFoundException

2022-01-19 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-25877:
--

 Summary: Load table from concurrent thread causes 
FileNotFoundException
 Key: HIVE-25877
 URL: https://issues.apache.org/jira/browse/HIVE-25877
 Project: Hive
  Issue Type: Bug
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


As part of the direct insert optimisation (same issue is there for MM table 
also, without direct insert optimisation), the files from Tez jobs are moved to 
the table directory for ACID tables. Then the duplicate removal is done. Each 
session scan through the tables and cleans up the file related to specific 
session. But the iterator is created over all the files. So the 
FileNotFoundException is thrown when multiple sessions are acting on same table 
and the first session cleans up its data which is being read by the second 
session.
{code:java}
Caused by: java.io.FileNotFoundException: File 
hdfs://mbehera-1.mbehera.root.hwx.site:8020/warehouse/tablespace/managed/hive/tbl4/_tmp.delta_981_981_
 does not exist.
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1275)
 ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?]
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1249)
 ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?]
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1194)
 ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?]
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1190)
 ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?]
        at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 ~[hadoop-common-3.1.1.7.2.14.0-117.jar:?]
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(DistributedFileSystem.java:1208)
 ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?]
        at 
org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:2144) 
~[hadoop-common-3.1.1.7.2.14.0-117.jar:?]
        at 
org.apache.hadoop.fs.FileSystem$5.handleFileStat(FileSystem.java:2332) 
~[hadoop-common-3.1.1.7.2.14.0-117.jar:?]
        at org.apache.hadoop.fs.FileSystem$5.hasNext(FileSystem.java:2309) 
~[hadoop-common-3.1.1.7.2.14.0-117.jar:?]
        at 
org.apache.hadoop.hive.ql.exec.Utilities.getDirectInsertDirectoryCandidatesRecursive(Utilities.java:4447)
 ~[hive-exec-3.1.3000.7.2.14.0-117.jar:3.1.3000.7.2.14.0-SNAPSHOT]
        at 
org.apache.hadoop.hive.ql.exec.Utilities.getDirectInsertDirectoryCandidates(Utilities.java:4413)
 ~[hive-exec-3.1.3000.7.2.14.0-117.jar:3.1.3000.7.2.14.0-SNAPSHOT]
        at 
org.apache.hadoop.hive.ql.metadata.Hive.getValidPartitionsInPath(Hive.java:2816)
 ~[hive-exec-3.1.3000.7.2.14.0-117.jar:3.1.3000.7.2.14.0-SNAPSHOT] {code}
 
{code:java}
Caused by: java.io.FileNotFoundException: File 
hdfs://mbehera-1.mbehera.root.hwx.site:8020/warehouse/tablespace/managed/hive/tbl4/.hive-staging_hive_2022-01-19_05-18-38_933_1683918321120508074-54
 does not exist.
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1275)
 ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?]
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1249)
 ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?]
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1194)
 ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?]
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1190)
 ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?]
        at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 ~[hadoop-common-3.1.1.7.2.14.0-117.jar:?]
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(DistributedFileSystem.java:1208)
 ~[hadoop-hdfs-client-3.1.1.7.2.14.0-117.jar:?]
        at 
org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:2144) 
~[hadoop-common-3.1.1.7.2.14.0-117.jar:?]
        at 
org.apache.hadoop.fs.FileSystem$5.handleFileStat(FileSystem.java:2332) 
~[hadoop-common-3.1.1.7.2.14.0-117.jar:?]
        at org.apache.hadoop.fs.FileSystem$5.hasNext(FileSystem.java:2309) 
~[hadoop-common-3.1.1.7.2.14.0-117.jar:?]
        at 
org.apache.hadoop.hive.ql.exec.Utilities.getDirectInsertDirectoryCandidatesRecursive(Utilities.java:4447)
 ~[hive-exec-3.1.3000.7.2.14.0-117.jar:3.1.3000.7.2.14.0-SNAPSHOT]
        at 
org.apache.hadoop.hive.ql.exec.Utilities.getDirectInsertDirectoryCandidates(Utilities.java:4413)
 ~[hive-exec-3.1.3000.7.2.14.0-117.jar:3.1.3000.7.2.14.0-SNAPSHOT]
        at 
org.apache.hadoop.hive.ql.exec.Utilities.getFullDPSpecs(Utilities.java:2971) 
~[hive-exec-3.1.3000.7.2.14.0-117.jar:3.1.3000.7.2.14.0-SNAPSHOT] {code}



--
This 

[jira] [Created] (HIVE-25868) AcidHouseKeeperService fails to purgeCompactionHistory if the entries in COMPLETED_COMPACTIONS tables

2022-01-16 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-25868:
--

 Summary: AcidHouseKeeperService fails to purgeCompactionHistory if 
the entries in COMPLETED_COMPACTIONS tables 
 Key: HIVE-25868
 URL: https://issues.apache.org/jira/browse/HIVE-25868
 Project: Hive
  Issue Type: Bug
  Components: Hive, Metastore, Standalone Metastore
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


To purge the entries, prepared statement is created. If the number of entries 
in the prepared statement goes beyond the limit of backend db (for postgres it 
around 32k), then the operation fails.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HIVE-25864) Hive query optimisation creates wrong plan for predicate pushdown with windowing function

2022-01-12 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-25864:
--

 Summary: Hive query optimisation creates wrong plan for predicate 
pushdown with windowing function 
 Key: HIVE-25864
 URL: https://issues.apache.org/jira/browse/HIVE-25864
 Project: Hive
  Issue Type: Bug
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


In case of a query with windowing function, the deterministic predicates are 
pushed down below the window function. Before pushing down, the predicate is 
converted to refer the project operator values. But the same conversion is done 
again while creating the project and thus causing wrong plan generation.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HIVE-25808) Analyse table does not fail for non existing partitions

2021-12-14 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-25808:
--

 Summary: Analyse table does not fail for non existing partitions
 Key: HIVE-25808
 URL: https://issues.apache.org/jira/browse/HIVE-25808
 Project: Hive
  Issue Type: Bug
Reporter: mahesh kumar behera


If all the column names are given in the analyse command , then the query 
fails. But if all the partition column values are not given then its not 
failing.

analyze table tbl partition *(fld1 = 2, fld2 = 3)* COMPUTE STATISTICS FOR 
COLUMNS – This will fail with SemanticException, if partition corresponds to 
fld1 = 2, fld2 = 3 does not exists. But analyze table tbl partition *(fld1 = 
2)* COMPUTE STATISTICS FOR COLUMNS, this will not fail and it will compute 
stats for whole table.

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HIVE-25778) Hive DB creation is failing when MANAGEDLOCATION is specified with existing location

2021-12-06 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-25778:
--

 Summary: Hive DB creation is failing when MANAGEDLOCATION is 
specified with existing location
 Key: HIVE-25778
 URL: https://issues.apache.org/jira/browse/HIVE-25778
 Project: Hive
  Issue Type: Bug
  Components: HiveServer2, Metastore
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


As part of HIVE-23387 check is added to restrict user from creating database 
with managed table location, if the location is already present. This was not 
the case. As this is causing backward compatibility issue, the check needs to 
be removed.

 
{code:java}
if (madeManagedDir) {
  LOG.info("Created database path in managed directory " + dbMgdPath);
} else {
  throw new MetaException(
  "Unable to create database managed directory " + dbMgdPath + ", failed to 
create database " + db.getName());
}  {code}
 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HIVE-25638) Select returns the deleted records in Hive ACID table

2021-10-24 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-25638:
--

 Summary: Select returns the deleted records in Hive ACID table
 Key: HIVE-25638
 URL: https://issues.apache.org/jira/browse/HIVE-25638
 Project: Hive
  Issue Type: Bug
  Components: HiveServer2
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


Hive stores the stripe stats in the ORC files. During select, these stats are 
used to create the SARG. The SARG is used to reduce the records read from the 
delete-delta files. Currently, in case where the number of stripes are more 
than 1, the SARG generated is not proper as it uses the first stripe index for 
both min and max key interval. The max key interval should be obtained from 
last stripe index.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-25540) Enable batch updation of column stats only for MySql and Postgres

2021-09-20 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-25540:
--

 Summary: Enable batch updation of column stats only for MySql and 
Postgres 
 Key: HIVE-25540
 URL: https://issues.apache.org/jira/browse/HIVE-25540
 Project: Hive
  Issue Type: Sub-task
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


The batch updation of partition column stats using direct sql is tested only 
for MySql and Postgres.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-25527) LLAP Scheduler task exits with fatal error if the executor node is down.

2021-09-15 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-25527:
--

 Summary: LLAP Scheduler task exits with fatal error if the 
executor node is down.
 Key: HIVE-25527
 URL: https://issues.apache.org/jira/browse/HIVE-25527
 Project: Hive
  Issue Type: Bug
  Components: HiveServer2
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


In case the executor host has gone down, activeInstances will be updated with 
null. So we need to check for empty/null values before accessing it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-25438) Update partition column stats fails with invalid syntax error for MySql

2021-08-08 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-25438:
--

 Summary: Update partition column stats fails with invalid syntax 
error for MySql
 Key: HIVE-25438
 URL: https://issues.apache.org/jira/browse/HIVE-25438
 Project: Hive
  Issue Type: Sub-task
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


The quotes are not supported by mysql if  ANSI_QUOTES is not set.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-25432) Support Join reordering for null safe equality operator.

2021-08-05 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-25432:
--

 Summary: Support Join reordering for null safe equality operator.
 Key: HIVE-25432
 URL: https://issues.apache.org/jira/browse/HIVE-25432
 Project: Hive
  Issue Type: Sub-task
  Components: Hive, HiveServer2
Reporter: mahesh kumar behera


Support Join reordering for null safe equality operator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-25431) Enable CBO for null safe equality operator.

2021-08-05 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-25431:
--

 Summary: Enable CBO for null safe equality operator.
 Key: HIVE-25431
 URL: https://issues.apache.org/jira/browse/HIVE-25431
 Project: Hive
  Issue Type: Bug
  Components: Hive, HiveServer2
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


The CBO is disabled for null safe equality (<=>)  operator. This is causing the 
sub optimal join execution  for some queries. As null safe equality is 
supported by joins, the CBO can be enabled for it. There will be issues with 
join reordering as Hive does not support join reordering for null safe equality 
operator. But with CBO enabled the join plan will be better.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-25417) Null bit vector is not handled while getting the stats for Postgres backend

2021-08-02 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-25417:
--

 Summary: Null bit vector is not handled while getting the stats 
for Postgres backend
 Key: HIVE-25417
 URL: https://issues.apache.org/jira/browse/HIVE-25417
 Project: Hive
  Issue Type: Sub-task
  Components: HiveServer2
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


While adding stats with null bit vector, a special string "HL" is added as 
Postgres does not support null value for byte columns. But while getting the 
stats, the conversion to null is not done. This is causing failure during 
deserialisation of bit vector field if the existing stats is used for merge.

 
{code:java}
 The input stream is not a HyperLogLog stream.  7276-1 instead of 727676 or 
7077^Mat 
org.apache.hadoop.hive.common.ndv.hll.HyperLogLogUtils.checkMagicString(HyperLogLogUtils.java:349)^M
 at 
org.apache.hadoop.hive.common.ndv.hll.HyperLogLogUtils.deserializeHLL(HyperLogLogUtils.java:139)^M
   at 
org.apache.hadoop.hive.common.ndv.hll.HyperLogLogUtils.deserializeHLL(HyperLogLogUtils.java:213)^M
   at 
org.apache.hadoop.hive.common.ndv.hll.HyperLogLogUtils.deserializeHLL(HyperLogLogUtils.java:227)^M
   at 
org.apache.hadoop.hive.common.ndv.NumDistinctValueEstimatorFactory.getNumDistinctValueEstimator(NumDistinctValueEstimatorFactory.java:53)^M
  at 
org.apache.hadoop.hive.metastore.columnstats.cache.LongColumnStatsDataInspector.updateNdvEstimator(LongColumnStatsDataInspector.java:124)^M
  at 
org.apache.hadoop.hive.metastore.columnstats.cache.LongColumnStatsDataInspector.getNdvEstimator(LongColumnStatsDataInspector.java:107)^M
 at 
org.apache.hadoop.hive.metastore.columnstats.merge.LongColumnStatsMerger.merge(LongColumnStatsMerger.java:36)^M
  at 
org.apache.hadoop.hive.metastore.utils.MetaStoreUtils.mergeColStats(MetaStoreUtils.java:1174)^M
  at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.updateTableColumnStatsWithMerge(HiveMetaStore.java:8934)^M
 at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.set_aggr_stats_for(HiveMetaStore.java:8800)^M
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)^M
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)^M 
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)^M
  at java.lang.reflect.Method.invoke(Method.java:498)^M   at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:160)^M
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:121)^M
at com.sun.proxy.$Proxy35.set_aggr_stats_for(Unknown Source)^M  at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$set_aggr_stats_for.getResult(ThriftHiveMetastore.java:20489)^M
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$set_aggr_stats_for.getResult(ThriftHiveMetastore.java:20473)^M
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)^M at 
org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)^M   at 
org.apache.hadoop.hive.metastore.security.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor$1.run(HadoopThriftAuthBridge.java:643)^M
   at 
org.apache.hadoop.hive.metastore.security.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor$1.run(HadoopThriftAuthBridge.java:638)^M
   at java.security.AccessController.doPrivileged(Native Method)^M at 
javax.security.auth.Subject.doAs(Subject.java:422)^M at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898)^M
   at 
org.apache.hadoop.hive.metastore.security.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor.process(HadoopThriftAuthBridge.java:638)^M
 at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)^M
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)^M
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)^M
at java.lang.Thread.run(Thread.java:748) {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-25373) Modify buildColumnStatsDesc to send configured number of stats for updation

2021-07-22 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-25373:
--

 Summary: Modify buildColumnStatsDesc to send configured number of 
stats for updation
 Key: HIVE-25373
 URL: https://issues.apache.org/jira/browse/HIVE-25373
 Project: Hive
  Issue Type: Sub-task
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


The number of stats sent for updation should be controlled to avoid thrift 
error in case the size exceeds the limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-25342) Optimize set_aggr_stats_for for mergeColStats path.

2021-07-18 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-25342:
--

 Summary: Optimize set_aggr_stats_for for mergeColStats path. 
 Key: HIVE-25342
 URL: https://issues.apache.org/jira/browse/HIVE-25342
 Project: Hive
  Issue Type: Sub-task
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


The optimisation used for normal path to use direct sql can also be used for 
mergeColStats

path. The stats to be updated can be accumulated in a temp list and that list 
can be used to update the stats in a batch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-25251) Reduce overhead of adding partitions during batch loading of partitions.

2021-06-15 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-25251:
--

 Summary: Reduce overhead of adding partitions during batch loading 
of partitions.
 Key: HIVE-25251
 URL: https://issues.apache.org/jira/browse/HIVE-25251
 Project: Hive
  Issue Type: Sub-task
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


The add partitions call done to HMS does a serial execution of data nucleus 
calls to add the partitions to backend DB. This can be further optimised by 
batching those sql statements.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-25225) Update column stat throws NPE if direct sql is disabled

2021-06-09 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-25225:
--

 Summary: Update column stat throws NPE if direct sql is disabled
 Key: HIVE-25225
 URL: https://issues.apache.org/jira/browse/HIVE-25225
 Project: Hive
  Issue Type: Sub-task
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


In case direct sql is disabled, the MetaStoreDirectSql object is not 
initialised and thats causing NPE. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-25205) Reduce overhead of adding write notification log during batch loading of partition.

2021-06-06 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-25205:
--

 Summary: Reduce overhead of adding write notification log during 
batch loading of partition.
 Key: HIVE-25205
 URL: https://issues.apache.org/jira/browse/HIVE-25205
 Project: Hive
  Issue Type: Sub-task
  Components: Hive, HiveServer2
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


During batch loading of partition the write notification logs are added for 
each partition added. This is causing delay in execution as the call to HMS is 
done for each partition. This can be optimised by adding a new API in HMS to 
send a batch of partition and then this batch can be added together to the 
backend database. Once we have a batch of notification log, at HMS side, code 
can be optimised to add the logs using single call to backend RDBMS. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-25204) Reduce overhead of adding notification log for update partition column statistics

2021-06-06 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-25204:
--

 Summary: Reduce overhead of adding notification log for update 
partition column statistics
 Key: HIVE-25204
 URL: https://issues.apache.org/jira/browse/HIVE-25204
 Project: Hive
  Issue Type: Sub-task
  Components: Hive, HiveServer2
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


The notification logs for partition column statistics can be optimised by 
adding them in batch. In the current implementation its done one by one causing 
multiple sql execution in the backend RDBMS. These SQL executions can be 
batched to reduce the execution time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-25181) Analyse and optimise execution time for batch loading of partitions.

2021-05-31 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-25181:
--

 Summary: Analyse and optimise execution time for batch loading of 
partitions.
 Key: HIVE-25181
 URL: https://issues.apache.org/jira/browse/HIVE-25181
 Project: Hive
  Issue Type: Task
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


When load partition is done in batch, of more than 10k, the execution time is 
exceeding hours. This may be an issue for ETL type of work load. This task is 
to track the issues and fix it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-25142) Rehashing in map join fast hash table causing corruption for large keys

2021-05-19 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-25142:
--

 Summary: Rehashing in map join fast hash table  causing corruption 
for large keys
 Key: HIVE-25142
 URL: https://issues.apache.org/jira/browse/HIVE-25142
 Project: Hive
  Issue Type: Bug
  Components: Hive
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


In map join the hash table is created using the keys. To support rehashing, the 
keys are stored in write buffer. The hash table contains the offset of the keys 
along with the hash code. When rehashing is done, the offset is extracted from 
the hash table and then hash code is generated again. For large keys of size 
greater than 255, the key length is also stored along with the key. In case of 
fast hash table implementation the way key is extracted is not proper. There is 
a code bug and thats causing the wrong key to be extracted and causing wrong 
hash code generation. This is causing the corruption in the hash table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-25042) Add support for map data type in Common merge join and SMB Join

2021-04-20 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-25042:
--

 Summary: Add support for map data type in Common merge join and 
SMB Join
 Key: HIVE-25042
 URL: https://issues.apache.org/jira/browse/HIVE-25042
 Project: Hive
  Issue Type: Sub-task
  Components: Hive, HiveServer2
Reporter: mahesh kumar behera


Merge join results depends on the underlying sorter used by the mapper task as 
we need to judge the direction after each key comparison. So the comparison 
done during join has to match the way the records are sorted by the mapper. As 
per the sorter used by mapper task (PipelinedSorter), hash-maps with same 
key-value pair in different order are not equal. So the merge join also behaves 
the same way. But map join treats them as equal. We have to modify the 
pipelined sorter code to handle the map datatype. Then support has to be added 
in the Join code to support map types.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24996) Conversion of PIG script with multiple store causing the merging of multiple sql statements

2021-04-09 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-24996:
--

 Summary: Conversion of PIG script with multiple store causing the 
merging of multiple sql statements
 Key: HIVE-24996
 URL: https://issues.apache.org/jira/browse/HIVE-24996
 Project: Hive
  Issue Type: Bug
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


The sql write is not reset after sql statement is converted. This is causing 
the next sql statements to be merged with the previous one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24995) Add support for complex type operator in Join with non equality condition

2021-04-09 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-24995:
--

 Summary: Add support for complex type operator in Join with non 
equality condition 
 Key: HIVE-24995
 URL: https://issues.apache.org/jira/browse/HIVE-24995
 Project: Hive
  Issue Type: Sub-task
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


This subtask is specifically to support non equal comparison like greater than, 
smaller than etc as join condition. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24989) Support vectorisation of join with key columns of complex types

2021-04-07 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-24989:
--

 Summary: Support vectorisation of join with key columns of complex 
types
 Key: HIVE-24989
 URL: https://issues.apache.org/jira/browse/HIVE-24989
 Project: Hive
  Issue Type: Bug
  Components: Hive
Affects Versions: 4.0.0
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera
 Fix For: 4.0.0


Hive fails to execute joins on array type columns as the comparison functions 
are not able to handle array type columns.   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24988) Add support for complex types columns for Dynamic Partition pruning Optimisation

2021-04-07 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-24988:
--

 Summary: Add support for complex types columns for Dynamic 
Partition pruning Optimisation
 Key: HIVE-24988
 URL: https://issues.apache.org/jira/browse/HIVE-24988
 Project: Hive
  Issue Type: Bug
  Components: Hive
Affects Versions: 4.0.0
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera
 Fix For: 4.0.0


Hive fails to execute joins on array type columns as the comparison functions 
are not able to handle array type columns.   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24977) Query compilation failing with NPE during reduce sink deduplication

2021-04-05 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-24977:
--

 Summary: Query compilation failing with NPE during reduce sink 
deduplication
 Key: HIVE-24977
 URL: https://issues.apache.org/jira/browse/HIVE-24977
 Project: Hive
  Issue Type: Bug
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


During reduce sink deduplication if some columns from the RS can not be 
backtracked to a 

terminal operator then null is returned. Check for null is present for some 
case and its missing in some cases. 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24883) Add support for array type columns in Hive Joins

2021-03-13 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-24883:
--

 Summary: Add support for array type columns in Hive Joins
 Key: HIVE-24883
 URL: https://issues.apache.org/jira/browse/HIVE-24883
 Project: Hive
  Issue Type: Bug
  Components: Hive
Affects Versions: 4.0.0
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera
 Fix For: 4.0.0


Hive fails to execute joins on array type columns as the comparison functions 
are not able to handle array type columns.   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24589) Drop catalog failing with deadlock error for Oracle backend dbms.

2021-01-05 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-24589:
--

 Summary: Drop catalog failing with deadlock error for Oracle 
backend dbms.
 Key: HIVE-24589
 URL: https://issues.apache.org/jira/browse/HIVE-24589
 Project: Hive
  Issue Type: Bug
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


When we do a drop catalog we drop the catalog from the CTLGS table. The DBS 
table has a foreign key reference on CTLGS for CTLG_NAME. This is causing the 
DBS table to be locked exclusively and causing deadlocks. This can be avoided 
by creating an index in the DBS table on CTLG_NAME.
{code:java}
CREATE INDEX CTLG_NAME_DBS ON DBS(CTLG_NAME); {code}
{code:java}
 Oracle Database maximizes the concurrency control of parent keys in relation 
to dependent foreign keys.Locking behaviour depends on whether foreign key 
columns are indexed. If foreign keys are not indexed, then the child table will 
probably be locked more frequently, deadlocks will occur, and concurrency will 
be decreased. For this reason foreign keys should almost always be indexed. The 
only exception is when the matching unique or primary key is never updated or 
deleted.{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24580) Add support for combiner in hash mode group aggregation (Support for distinct)

2021-01-03 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-24580:
--

 Summary: Add support for combiner in hash mode group aggregation 
(Support for distinct)
 Key: HIVE-24580
 URL: https://issues.apache.org/jira/browse/HIVE-24580
 Project: Hive
  Issue Type: Bug
  Components: Hive
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


In map side group aggregation, partial grouped aggregation is calculated to 
reduce the data written to disk by map task. In case of hash aggregation, where 
the input data is not sorted, hash table is used (with sorting also being 
performed before flushing). If the hash table size increases beyond 
configurable limit, data is flushed to disk and new hash table is generated. If 
the reduction by hash table is less than min hash aggregation reduction 
calculated during compile time, the map side aggregation is converted to 
streaming mode. So if the first few batch of records does not result into 
significant reduction, then the mode is switched to streaming mode. This may 
have impact on performance, if the subsequent batch of records have less number 
of distinct values. 

To improve performance both in Hash and Streaming mode, a combiner can be added 
to the map task after the keys are sorted. This will make sure that the 
aggregation is done if possible and reduce the data written to disk.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24503) Optimize vector row serde to avoid type check at run time

2020-12-08 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-24503:
--

 Summary: Optimize vector row serde to avoid type check at run time 
 Key: HIVE-24503
 URL: https://issues.apache.org/jira/browse/HIVE-24503
 Project: Hive
  Issue Type: Bug
  Components: Hive
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


Serialization/Deserialization of vectorized batch done at VectorSerializeRow 
and VectorDeserializeRow does a type checking for each column of each row. This 
becomes very costly when there are billions of rows to read/write. This can be 
optimized if the type check is done during init time and specific reader/writer 
classes are created. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24471) Add support for combiner in hash mode group aggregation

2020-12-02 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-24471:
--

 Summary: Add support for combiner in hash mode group aggregation 
 Key: HIVE-24471
 URL: https://issues.apache.org/jira/browse/HIVE-24471
 Project: Hive
  Issue Type: Bug
  Components: Hive
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


In map side group aggregation, partial grouped aggregation is calculated to 
reduce the data written to disk by map task. In case of hash aggregation, where 
the input data is not sorted, hash table is used. If the hash table size 
increases beyond configurable limit, data is flushed to disk and new hash table 
is generated. If the reduction by hash table is less than min hash aggregation 
reduction calculated during compile time, the map side aggregation is converted 
to streaming mode. So if the first few batch of records does not result into 
significant reduction, then the mode is switched to streaming mode. This may 
have impact on performance, if the subsequent batch of records have less number 
of distinct values. To mitigate this situation, a combiner can be added to the 
map task after the keys are sorted. This will make sure that the aggregation is 
done if possible and reduce the data written to disk.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24378) Leading and trailing spaces are not removed before decimal conversion

2020-11-12 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-24378:
--

 Summary: Leading and trailing spaces are not removed before 
decimal conversion
 Key: HIVE-24378
 URL: https://issues.apache.org/jira/browse/HIVE-24378
 Project: Hive
  Issue Type: Bug
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


The decimal conversion is taking care of removing the extra spaces in some 
scenarios. because of this the numbers are getting converted to null.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24373) Wrong predicate is pushed down for view with constant value projection.

2020-11-11 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-24373:
--

 Summary: Wrong predicate is pushed down for view with constant 
value projection.
 Key: HIVE-24373
 URL: https://issues.apache.org/jira/browse/HIVE-24373
 Project: Hive
  Issue Type: Bug
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


For below query the predicate pushed down for one of the table scan is not 
proper.

 
{code:java}
set hive.explain.user=false;
set hive.cbo.enable=false;
set hive.optimize.ppd=true;DROP TABLE arc;

CREATE table arc(`dt_from` string, `dt_to` string);
CREATE table loc1(`dt_from` string, `dt_to` string);

CREATE
 VIEW view AS
 SELECT
'' as DT_FROM,
uuid() as DT_TO
 FROM
   loc1
 UNION ALL
 SELECT
dt_from as DT_FROM,
uuid() as DT_TO
 FROM
   arc;

EXPLAIN
SELECT
  dt_from, dt_to
FROM
  view
WHERE
  '2020'  between dt_from and dt_to;


{code}
 

For table loc1,  DT_FROM is projected as '' so the predicate "predicate: 
'2020' BETWEEN '' AND _col1 (type: boolean)" is proper. But for table arc, 
the column is projected so the predicate should be "predicate: '2020' BETWEEN 
_col0 (type: boolean) AND _col1 (type: boolean)".

This is because the predicates are stored in a map for each expression. Here 
the expression is "_col0". When the predicate is pushed down the union, the 
same predicate is used for creating the filter expression. Later when constant 
replacement is done, the first filter is overwriting the second one.

So we should create a clone (as done at other places) before using the cached 
predicate for filter. This way the overwrite can be avoided.   

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24362) AST tree processing is suboptimal for tree with large number of nodes

2020-11-09 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-24362:
--

 Summary: AST tree processing is suboptimal for tree with large 
number of nodes
 Key: HIVE-24362
 URL: https://issues.apache.org/jira/browse/HIVE-24362
 Project: Hive
  Issue Type: Bug
  Components: Hive
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


In hive the children information is stored as list of objects. During 
processing of the children of a node, the list of object is converted to list 
of Nodes. This can cause large compilation time if the number of children is 
large. The list of children can be cached in the AST node to avoid this 
re-computation. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24284) NPE when parsing druid logs using Hive

2020-10-18 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-24284:
--

 Summary: NPE when parsing druid logs using Hive
 Key: HIVE-24284
 URL: https://issues.apache.org/jira/browse/HIVE-24284
 Project: Hive
  Issue Type: Bug
  Components: Hive
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


As per current Sys-logger parser, its always expecting a valid proc id. But as 
per RFC3164 and RFC5424, the proc id can be skipped.So hive should handled it 
by using NILVALUE/empty string in case the proc id is null.

 
{code:java}
Caused by: java.lang.NullPointerException: null
at java.lang.String.(String.java:566)
at 
org.apache.hadoop.hive.ql.log.syslog.SyslogParser.createEvent(SyslogParser.java:361)
at 
org.apache.hadoop.hive.ql.log.syslog.SyslogParser.readEvent(SyslogParser.java:326)
at 
org.apache.hadoop.hive.ql.log.syslog.SyslogSerDe.deserialize(SyslogSerDe.java:95)
 {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24198) Map side SMB join produceing wrong result

2020-09-24 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-24198:
--

 Summary: Map side SMB join produceing wrong result
 Key: HIVE-24198
 URL: https://issues.apache.org/jira/browse/HIVE-24198
 Project: Hive
  Issue Type: Bug
  Components: Hive
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


CREATE TABLE tbl1_n5(key int, value string) CLUSTERED BY (key) SORTED BY (key) 
INTO 2 BUCKETS ;
CREATE TABLE tbl2_n4(key int, value string) CLUSTERED BY (key) SORTED BY (key) 
INTO 2 BUCKETS;

set hive.auto.convert.join=true;
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
set hive.input.format = org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;

set hive.auto.convert.sortmerge.join=true;
set hive.auto.convert.sortmerge.join.to.mapjoin=false;
set hive.auto.convert.join.noconditionaltask.ize=1;

set hive.optimize.semijoin.conversion = false;


insert into tbl2_n4 values (2, 'val_2'), (0, 'val_0'), (0, 'val_0'), (0, 
'val_0'), (4, 'val_4') ,(5, 'val_5') ,(5, 'val_5') , (5, 'val_5'), (8, 
'val_8'), (9, 'val_9');


insert into tbl1_n5 values (2, 'val_2'), (0, 'val_0'), (0, 'val_0'), (0, 
'val_0'), (4, 'val_4') ,(5, 'val_5') ,(5, 'val_5') , (5, 'val_5'), (8, 
'val_8'), (9, 'val_9');

 

Select * from (select b.key as key, count(*) as value from tbl1_n5 b where key 
< 6 group by b.key) subq1 join (select a.key as key, a.value as value from 
tbl2_n4 a where key < 6) subq2 on subq1.key = subq2.key;

 

The above select is producing 0,0,0,2,4,5,5,5,5,5,5 instead of 0,0,0,2,4,5,5,5



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24013) Move anti join conversion after join reordering rule

2020-08-06 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-24013:
--

 Summary: Move anti join conversion after join reordering rule
 Key: HIVE-24013
 URL: https://issues.apache.org/jira/browse/HIVE-24013
 Project: Hive
  Issue Type: Bug
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


The current anti join conversion does not check for null filters on right side 
of join if it's within OR conditions. Only those filters separated by AND 
conditions are supported. For example queries like "select t1.fld from tbl1 t1 
left join tbl2 t2 on t1.fld = t2.fld where t2.fld is null or t2.fld1 is null" 
are not converted to anti join. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23992) Support null filter within or clause for Anti Join

2020-08-04 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-23992:
--

 Summary: Support null filter within or clause for Anti Join
 Key: HIVE-23992
 URL: https://issues.apache.org/jira/browse/HIVE-23992
 Project: Hive
  Issue Type: Bug
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


The current anti join conversion does not support join condition which is 
always true. The queries like select * from tbl t1 where not exists (select 1 
from t2) is not converted to anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23991) Support isAlwaysTrue for Anti Join

2020-08-04 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-23991:
--

 Summary: Support isAlwaysTrue for Anti Join
 Key: HIVE-23991
 URL: https://issues.apache.org/jira/browse/HIVE-23991
 Project: Hive
  Issue Type: Bug
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


The current anti join conversion does not support direct conversion of 
not-exists to anti join. The not exists sub query is converted first to left 
out join and then its converted to anti join. This may cause some of the 
optimization rule to be skipped.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23981) Use task counter enum to get the approximate counter value

2020-08-03 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-23981:
--

 Summary: Use task counter enum to get the approximate counter value
 Key: HIVE-23981
 URL: https://issues.apache.org/jira/browse/HIVE-23981
 Project: Hive
  Issue Type: Bug
Reporter: mahesh kumar behera


There are cases when compiler misestimates key count and this results in a 
number of hashtable resizes during runtime.

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/fast/VectorMapJoinFastHashTableLoader.java#L128]

In such cases, it would be good to get "approximate_input_records" (TEZ-4207) 
counter from upstream to compute the key count more accurately at runtime.

 
 * 
 * 
Options
h4.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23933) Add getRowCountInt support for anti join in calcite.

2020-07-26 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-23933:
--

 Summary: Add getRowCountInt support for  anti join in calcite. 
 Key: HIVE-23933
 URL: https://issues.apache.org/jira/browse/HIVE-23933
 Project: Hive
  Issue Type: Bug
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


The current anti join conversion does not support direct conversion of 
not-exists to anti join. The not exists sub query is converted first to left 
out join and then its converted to anti join. This may cause some of the 
optimization rule to be skipped.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23928) Support conversion of not-exists to Anti join directly

2020-07-24 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-23928:
--

 Summary: Support conversion of not-exists to Anti join directly
 Key: HIVE-23928
 URL: https://issues.apache.org/jira/browse/HIVE-23928
 Project: Hive
  Issue Type: Bug
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


Support HiveJoinProjectTransposeRule for Anti Join

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23921) Support HiveJoinProjectTransposeRule for Anti Join

2020-07-23 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-23921:
--

 Summary: Support HiveJoinProjectTransposeRule for Anti Join
 Key: HIVE-23921
 URL: https://issues.apache.org/jira/browse/HIVE-23921
 Project: Hive
  Issue Type: Bug
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


 If we have a PK-FK join that is only appending columns to the FK side, it 
basically means it is not filtering anything (everything is matching). If that 
is the case, then ANTIJOIN result would be empty. We could detect this at 
planning time and trigger the rewriting.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23920) Need to handle HiveJoinConstraintsRule for Anti Join

2020-07-23 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-23920:
--

 Summary: Need to handle HiveJoinConstraintsRule for Anti Join
 Key: HIVE-23920
 URL: https://issues.apache.org/jira/browse/HIVE-23920
 Project: Hive
  Issue Type: Bug
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


Currently in Hive we create different operator for different kind of join. n 
Calcite, it all seems to be based on a single Join class in newer releases. So 
the classes like HiveAntiJoin, HiveSemiJoin can be merged into one.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23919) Merge all kind of Join operator variants (Semi, Anti, Normal) into one.

2020-07-23 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-23919:
--

 Summary: Merge all kind of Join operator variants (Semi, Anti, 
Normal) into one. 
 Key: HIVE-23919
 URL: https://issues.apache.org/jira/browse/HIVE-23919
 Project: Hive
  Issue Type: Bug
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


For Anti Join, we emit the records if the join condition does not satisfies. In 
case of PK-FK rule we have to explore if this can be exploited to speed up Anti 
Join processing.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23907) Hash table type should be considered for calculating the Map join table size

2020-07-23 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-23907:
--

 Summary: Hash table type should be considered for calculating the 
Map join table size
 Key: HIVE-23907
 URL: https://issues.apache.org/jira/browse/HIVE-23907
 Project: Hive
  Issue Type: Bug
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


For Anti Join, we emit the records if the join condition does not satisfies. In 
case of PK-FK rule we have to explore if this can be exploited to speed up Anti 
Join processing.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23906) Analyze and implement PK-FK based optimization for Anti join

2020-07-23 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-23906:
--

 Summary: Analyze and implement PK-FK based optimization for Anti 
join
 Key: HIVE-23906
 URL: https://issues.apache.org/jira/browse/HIVE-23906
 Project: Hive
  Issue Type: Bug
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


Currently hive does not support Anti join. The query for anti join is converted 
to left outer join and null filter on right side join key is added to get the 
desired result. This is causing
 # Extra computation — The left outer join projects the redundant columns from 
right side. Along with that, filtering is done to remove the redundant rows. 
This is can be avoided in case of anti join as anti join will project only the 
required columns and rows from the left side table.
 # Extra shuffle — In case of anti join the duplicate records moved to join 
node can be avoided from the child node. This can reduce significant amount of 
data movement if the number of distinct rows( join keys) is significant.
 # Extra Memory Usage - In case of map based anti join , hash set is sufficient 
as just the key is required to check  if the records matches the join 
condition. In case of left join, we need the key and the non key columns also 
and thus a hash table will be required.

For a query like
{code:java}
 select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
The number of distinct ws_order_number in web_sales table in a typical 10TB 
TPCDS set up is just 10% of total records. So when we convert this query to 
anti join, instead of 7 billion rows, only 600 million rows are moved to join 
node.

In the current patch, just one conversion is done. The pattern of 
project->filter->left-join is converted to project->anti-join. This will take 
care of sub queries with “not exists” clause. The queries with “not exists” are 
converted first to filter + left-join and then its converted to anti join. The 
queries with “not in” are not handled in the current patch.

>From execution side, both merge join and map join with vectorized execution  
>is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23905) Remove duplicate code in vector map join execution for Anti join and Semi Join.

2020-07-22 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-23905:
--

 Summary: Remove duplicate code in vector map join execution for 
Anti join and Semi Join.
 Key: HIVE-23905
 URL: https://issues.apache.org/jira/browse/HIVE-23905
 Project: Hive
  Issue Type: Bug
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


[TestMapJoinOperator.java|https://github.com/apache/hive/pull/1147/files/ee4390223caf1816ba6c07c1245876dc3c99d1e9#diff-a96ed41dcf0566f31b90b5ac75fbf20b]
 should be updated to add test cases related to anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23904) Update TestMapJoinOperator for adding anti join test cases.

2020-07-22 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-23904:
--

 Summary: Update TestMapJoinOperator for adding anti join test 
cases.
 Key: HIVE-23904
 URL: https://issues.apache.org/jira/browse/HIVE-23904
 Project: Hive
  Issue Type: Bug
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


In case of anti join, bloom filter can be created on left side also ("IN 
(keylist right table)").But the filter should be "not-in" ("NOT IN (keylist 
right table)") as we want to select the records from left side which are not 
present in the right side. But it may cause wrong result as bloom filter may 
have false positive and thus simply adding not is not correct, special handling 
is required for "NOT IN".

[https://github.com/jmhodges/opposite_of_a_bloom_filter/]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23903) Support "not-in" for bloom filter

2020-07-22 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-23903:
--

 Summary: Support "not-in" for bloom filter
 Key: HIVE-23903
 URL: https://issues.apache.org/jira/browse/HIVE-23903
 Project: Hive
  Issue Type: Bug
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


Currently hive does not support Anti join. The query for anti join is converted 
to left outer join and null filter on right side join key is added to get the 
desired result. This is causing
 # Extra computation — The left outer join projects the redundant columns from 
right side. Along with that, filtering is done to remove the redundant rows. 
This is can be avoided in case of anti join as anti join will project only the 
required columns and rows from the left side table.
 # Extra shuffle — In case of anti join the duplicate records moved to join 
node can be avoided from the child node. This can reduce significant amount of 
data movement if the number of distinct rows( join keys) is significant.
 # Extra Memory Usage - In case of map based anti join , hash set is sufficient 
as just the key is required to check  if the records matches the join 
condition. In case of left join, we need the key and the non key columns also 
and thus a hash table will be required.

For a query like
{code:java}
 select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
The number of distinct ws_order_number in web_sales table in a typical 10TB 
TPCDS set up is just 10% of total records. So when we convert this query to 
anti join, instead of 7 billion rows, only 600 million rows are moved to join 
node.

In the current patch, just one conversion is done. The pattern of 
project->filter->left-join is converted to project->anti-join. This will take 
care of sub queries with “not exists” clause. The queries with “not exists” are 
converted first to filter + left-join and then its converted to anti join. The 
queries with “not in” are not handled in the current patch.

>From execution side, both merge join and map join with vectorized execution  
>is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23716) Support Anti Join in Hive

2020-06-17 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-23716:
--

 Summary: Support Anti Join in Hive 
 Key: HIVE-23716
 URL: https://issues.apache.org/jira/browse/HIVE-23716
 Project: Hive
  Issue Type: Bug
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


Currently hive does not support Anti join. The query for anti join is converted 
to left outer join and null filter on right side join key is added to get the 
desired result. This is causing
 # Extra computation — The left outer join projects the redundant columns from 
right side. Along with that, filtering is done to remove the redundant rows. 
This is can be avoided in case of anti join as anti join will project only the 
required columns and rows from the left side table.
 # Extra shuffle — In case of anti join the duplicate records moved to join 
node can be avoided from the child node. This can reduce significant amount of 
data movement if the number of distinct rows( join keys) is significant.
 # Extra Memory Usage - In case of map based anti join , hash set is sufficient 
as just the key is required to check  if the records matches the join 
condition. In case of left join, we need the key and the non key columns also 
and thus a hash table will be required.

For a query like
{code:java}
 select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
The number of distinct ws_order_number in web_sales table in a typical 10TB 
TPCDS set up is just 10% of total records. So when we convert this query to 
anti join, instead of 7 billion rows, only 600 million rows are moved to join 
node.

In the current patch, just one conversion is done. The pattern of 
project->filter->left-join is converted to project->anti-join. This will take 
care of sub queries with “not exists” clause. The queries with “not exists” are 
converted first to filter + left-join and then its converted to anti join. The 
queries with “not in” are not handled in the current patch.

>From execution side, both merge join and map join with vectorized execution  
>is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22856) Hive LLAP external client not reading data from ArrowStreamReader fully

2020-02-07 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-22856:
--

 Summary: Hive LLAP external client not reading data from 
ArrowStreamReader fully
 Key: HIVE-22856
 URL: https://issues.apache.org/jira/browse/HIVE-22856
 Project: Hive
  Issue Type: Bug
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


LlapArrowBatchRecordReader returns false when the ArrowStreamReader 
loadNextBatch returns column vector with 0 length. But we should keep reading 
data until loadNextBatch returns false. Some batch may return column vector of 
length 0, but we should ignore and wait for the next batch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22733) After disable operation log property in hive, still HS2 saving the operation log

2020-01-15 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-22733:
--

 Summary: After disable operation log property in hive, still HS2 
saving the operation log
 Key: HIVE-22733
 URL: https://issues.apache.org/jira/browse/HIVE-22733
 Project: Hive
  Issue Type: Bug
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


There are few issues in this area.
 1. If logging is disabled using hive.server2.logging.operation.enabled, then 
operation logs for the queries should not be generated. But the 
registerLoggingContext method in LogUtils, registers the logging context  even 
if the operation log is disabled. This causes the logs to be added by logger. 
The registration of query context should be done only if operation logging is 
enabled.
{code:java}
 public static void registerLoggingContext(Configuration conf) {
-MDC.put(SESSIONID_LOG_KEY, HiveConf.getVar(conf, 
HiveConf.ConfVars.HIVESESSIONID));
-MDC.put(QUERYID_LOG_KEY, HiveConf.getVar(conf, 
HiveConf.ConfVars.HIVEQUERYID));
 if (HiveConf.getBoolVar(conf, 
HiveConf.ConfVars.HIVE_SERVER2_LOGGING_OPERATION_ENABLED)) {
+  MDC.put(SESSIONID_LOG_KEY, HiveConf.getVar(conf, 
HiveConf.ConfVars.HIVESESSIONID));
+  MDC.put(QUERYID_LOG_KEY, HiveConf.getVar(conf, 
HiveConf.ConfVars.HIVEQUERYID));
   MDC.put(OPERATIONLOG_LEVEL_KEY, HiveConf.getVar(conf, 
HiveConf.ConfVars.HIVE_SERVER2_LOGGING_OPERATION_LEVEL));{code}
 

2. In case of failed query, we close the operations and that deletes the 
logging context (appender and route) from logger for that query. But if any log 
is added after that, the query logs are getting added and new operation log 
file is getting generated for the query. This looks like issue with MCD clear. 
MCD clear is not removing the keys from the map. If remove is used instead of 
clear, its working fine.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22695) DecimalColumnVector setElement throws class cast exception if input is of type LongColumnVector

2020-01-06 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-22695:
--

 Summary: DecimalColumnVector setElement throws class cast 
exception if input is of type LongColumnVector
 Key: HIVE-22695
 URL: https://issues.apache.org/jira/browse/HIVE-22695
 Project: Hive
  Issue Type: Bug
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


Before casting the input to decimal type, the type should be checked. For long 
and double type, the value should be extracted and from that decimal type 
should be created.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22365) "MetaException: Couldn't acquire the DB log notification lock because we reached the maximum # of retries" during metadata scale tests

2019-10-17 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-22365:
--

 Summary: "MetaException: Couldn't acquire the DB log notification 
lock because we reached the maximum # of retries" during metadata scale tests
 Key: HIVE-22365
 URL: https://issues.apache.org/jira/browse/HIVE-22365
 Project: Hive
  Issue Type: Bug
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


The issue is because of the leaked open transaction in 
Objectstore::getPartition function. Here if jdo throws some exception during 
convertToPart then commit is not done.
openTransaction();
MTable table = this.getMTable(catName, dbName, tableName);
MPartition mpart = getMPartition(catName, dbName, tableName, part_vals);
Partition part = convertToPart(mpart);
commitTransaction(); 
 

Because of this, all subsequent transactions of this thread are not committed.
{code:java}
if ((openTrasactionCalls == 0) && currentTransaction.isActive()) {
  transactionStatus = TXN_STATUS.COMMITED;
  currentTransaction.commit();
} {code}
This is causing the select for update lock on NOTIFICATION_SEQUENCE to never be 
released and all other threads are failing to get this lock and timing out.

So the fix to do the operation is a try-catch block and rollback the txn in 
case of failure.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22319) Repl load fails to create partition if the dump is from old version

2019-10-10 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-22319:
--

 Summary: Repl load fails to create partition if the dump is from 
old version
 Key: HIVE-22319
 URL: https://issues.apache.org/jira/browse/HIVE-22319
 Project: Hive
  Issue Type: Bug
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


The engine field of column  stats in partition descriptor needs to be 
initialized. Handling needs to be added for column stat events also.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22272) Hive embedded HS2 throws metastore exceptions from MetastoreStatsConnector thread

2019-09-30 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-22272:
--

 Summary: Hive embedded HS2 throws metastore exceptions from 
MetastoreStatsConnector thread
 Key: HIVE-22272
 URL: https://issues.apache.org/jira/browse/HIVE-22272
 Project: Hive
  Issue Type: Bug
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


Hive config is not passed to MetastoreStatsConnector. This causes 
RuntimeStatsLoader connects to embedded HMS (even tough HMS is configured to be 
remote) and causes metastore exceptions as metastore db will not be created. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22234) Hive replication fails with table already exist error when replicating from old version of hive.

2019-09-24 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-22234:
--

 Summary: Hive replication fails with table already exist error 
when replicating from old version of hive.
 Key: HIVE-22234
 URL: https://issues.apache.org/jira/browse/HIVE-22234
 Project: Hive
  Issue Type: Bug
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


HIve replication from old version where HIVE-22046 is not patched will not have 
engine column set in the table column stats. This causes "ERROR: null value in 
column "ENGINE" violates not-null constraint" error during create table while 
updating the column stats. As the column stats are updated after the create 
table txn is committed, the next retry by HMS client throws table already exist 
error. Need to update the ENGINE column to default value while importing the 
table if the column value is not set. The column stat and create table in same 
txn can be done as part of separate Jira.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22197) Common Merge join throwing class cast exception

2019-09-11 Thread mahesh kumar behera (Jira)
mahesh kumar behera created HIVE-22197:
--

 Summary: Common Merge join throwing class cast exception 
 Key: HIVE-22197
 URL: https://issues.apache.org/jira/browse/HIVE-22197
 Project: Hive
  Issue Type: Bug
  Components: Hive
Affects Versions: 4.0.0
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera
 Fix For: 4.0.0


In DummyStoreOperator the row is cached to fix HIVE-5973. The row is copyed and 
stored in the writable format, but the object inspector is initialized to 
default. So when join operator is fetching the data from dummy store operator, 
its getting the OI is Long and the row as LongWritable. This is causing the 
class cast exception.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (HIVE-22092) Fetch failing with IllegalArgumentException: No ValidTxnList when refetch is done

2019-08-08 Thread mahesh kumar behera (JIRA)
mahesh kumar behera created HIVE-22092:
--

 Summary: Fetch failing with IllegalArgumentException: No 
ValidTxnList when refetch is done
 Key: HIVE-22092
 URL: https://issues.apache.org/jira/browse/HIVE-22092
 Project: Hive
  Issue Type: Bug
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


The fetch task is created during query compilation with the config of the 
driver. That config will have the valid txn list set. Thus the fetch task will 
have valid txn list set while doing fetch for ACID tables. But when user does a 
refetch with cusrsor set to first position it reinitializes the fetch task with 
the driver config (cached in task config). But by that time, the select query 
would have cleaned up the valid txn list from the config and the fetch will 
happen with valid txn list as null. This will cause illegal argument exception.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (HIVE-21974) The list of table expression in the inclusion and exclusion list should be separated by '|' instead of comma.

2019-07-09 Thread mahesh kumar behera (JIRA)
mahesh kumar behera created HIVE-21974:
--

 Summary: The list of table expression in the inclusion and 
exclusion list should be separated by '|' instead of comma.
 Key: HIVE-21974
 URL: https://issues.apache.org/jira/browse/HIVE-21974
 Project: Hive
  Issue Type: Sub-task
  Components: repl
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


REPL DUMP fetches the events from NOTIFICATION_LOG table based on regular 
expression + inclusion/exclusion list. So, in case of rename table event, the 
event will be ignored if old table doesn't match the pattern but the new table 
should be bootstrapped. REPL DUMP should have a mechanism to detect such tables 
and automatically bootstrap with incremental replication.Also, if renamed table 
is excluded from replication policy, then need to drop the old table at target 
as well. 

There are 4 scenarios that needs to be handled.
 # Both new name and old name satisfies the table name pattern filter.
 ## No need to do anything. The incremental event for rename should take care 
of the replication.
 # Both the names does not satisfy the table name pattern filter.
 ## Both the names are not in the scope of the policy and thus nothing needs to 
be done.
 # New name satisfies the pattern but the old name does not.
 ## The table will not be present at the target.
 ## Rename event handler for dump should detect this case and add the new table 
name to the list of table for bootstrap.
 ## All the events related to the table (new name) should be ignored.
 ## If there is a drop event for the table (with new name), then remove the 
table from the list of tables to be bootstrapped.
 ## In case of rename (double rename)
 ### If the new name satisfies the table pattern, then add the new name to the 
list of tables to be bootstrapped and remove the old name from the list of 
tables to be bootstrapped.
 ### If the new name does not satisfies then just removed the table name from 
the list of tables to be bootstrapped.
 # New name does not satisfies the pattern but the old name satisfies.
 ## Change the rename event to a drop event.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21958) The list of table expression in the inclusion and exclusion list should be separated by '|' instead of comma.

2019-07-04 Thread mahesh kumar behera (JIRA)
mahesh kumar behera created HIVE-21958:
--

 Summary: The list of table expression in the inclusion and 
exclusion list should be separated by '|' instead of comma.
 Key: HIVE-21958
 URL: https://issues.apache.org/jira/browse/HIVE-21958
 Project: Hive
  Issue Type: Sub-task
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


Java regex expression does not support comma. If user wants multiple expression 
to be present in the include or exclude list, then the expressions can be 
provided separated by pipe ('|') character. The policy will look something like 
db_name.'(t1*)|(t3)'.'t100'



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21956) Add the list of table selected by dump in the dump folder.

2019-07-04 Thread mahesh kumar behera (JIRA)
mahesh kumar behera created HIVE-21956:
--

 Summary: Add the list of table selected by dump in the dump folder.
 Key: HIVE-21956
 URL: https://issues.apache.org/jira/browse/HIVE-21956
 Project: Hive
  Issue Type: Sub-task
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


The list of tables selected by a dump should be kept in the dump folder as a 
_tables file. This will help user to find out the tables replicated and the 
list can be used by user for ranger and atlas policy replication.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21926) CLONE - REPL - With table list - "TO" and "FROM" clause should not be allowed along with table filter list

2019-06-26 Thread mahesh kumar behera (JIRA)
mahesh kumar behera created HIVE-21926:
--

 Summary: CLONE - REPL - With table list - "TO" and "FROM" clause 
should not be allowed along with table filter list
 Key: HIVE-21926
 URL: https://issues.apache.org/jira/browse/HIVE-21926
 Project: Hive
  Issue Type: Sub-task
  Components: repl
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


If some rename events are found to be dumped and replayed while replace policy 
is getting executed, it needs to take care of the policy inclusion in both the 
policy for each table name.

 1. Create a list of tables to be bootstrapped. 

  2. During handling of alter table, if the alter type is rename 

      1. If the old table name is present in the list of table to be 
bootstrapped, remove it.

       2. If the new table name, matches the new policy, add it to the list of 
tables to be bootstrapped.

  3. During handling of drop table

       1. if the table is in the list of tables to be bootstrapped, then remove 
it and ignore the event.

  4. During other event handling 

       1. if the table is there in the list of tables to be bootstrapped, then 
ignore the event.

 

Rename handling during replace policy
 # Old name not matching old policy – The old table will not be there at the 
target cluster. The table will not be returned by get-all-table.
 ## Old name is not matching new policy
 ### New name not matching old policy
  New name not matching new policy
 * Ignore the event, no need to do anything.
  New name matching new policy
 * The table will be returned by get-all-table. Replace policy handler will 
bootstrap this table as its matching new policy and not matching old policy.
 * All the future events will be ignored as part of check added by replace 
policy handling.
 * All the event with old table name will anyways be ignored as the old 
name is not matching the new policy.
 ### New name matching old policy
  New name not matching new policy
 * As the new name is not matching the new policy, the table need not be 
replicated.
 * As the old name is not matching the new policy, the rename events will 
be ignored.
 * So nothing to be done for this scenario.
  New name matching new policy
 * As the new name is matching both old and new policy, replace handler 
will not bootstrap the table.
 * Add the table to the list of tables to be bootstrapped.
 * Ignore all the events with new name.
 * If there is a drop event for the table (with new name), then remove the 
table from the the list of table to be bootstrapped.
 * In case of rename event (double rename)
 ** If the new name satisfies the table pattern, then add the new name to 
the list of tables to be bootstrapped and remove the old name from the list of 
tables to be bootstrapped.
 ** If the new name does not satisfies then just removed the table name 
from the list of tables to be bootstrapped.
 ## Old name is matching new policy – As per replace policy handler, which 
checks based on old table, the table should be bootstrapped and event should be 
ignored. But rename handler should decide based on new name.The old table name 
will not be returned by get-all-table, so replace handler will not d anything 
for the old table.
 ### New name not matching old policy
  New name not matching new policy
 * As the old table is not there at target and new name is not matching new 
policy. Ignore the event.
 * No need to add the table to the list of tables to be bootstrapped.
 * All the subsequent events will be ignored as the new name is not 
matching the new policy.
  New name matching new policy
 * As the new name is not matching old policy but matching new policy, the 
table will be bootstrapped by replace policy handler. So rename event need not 
add this table to list of table to be bootstrapped.
 * All the future events will be ignored by replace policy handler.
 * For rename event (double rename)
 ** If there is a rename, the table (with intermittent new name) will not 
be present and thus replace handler will not bootstrap the table.
 ** So if the new name (the latest one) is matching the new policy, then 
add it to the list of table to be bootstrapped.
 ** And If the new name (the latest one)  is not matching the new policy, 
then just ignore the event as the  intermittent new name would not have added 
to the list of table to be bootstrapped.
 ### New name matching old policy
  New name not matching new policy
 * Dump the event. The table will be dropped by repl load at the target.
  New name matching new policy
 * Replace handler will not bootstrap this table as the new name is 
matching both policies.
 * As old name is not matching the old policy, the table will not be there 
at target. The rename event should add the new 

[jira] [Created] (HIVE-21886) REPL - With table list - Handle rename events during replace policy

2019-06-18 Thread mahesh kumar behera (JIRA)
mahesh kumar behera created HIVE-21886:
--

 Summary: REPL - With table list - Handle rename events during 
replace policy
 Key: HIVE-21886
 URL: https://issues.apache.org/jira/browse/HIVE-21886
 Project: Hive
  Issue Type: Sub-task
  Components: repl
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


REPL DUMP fetches the events from NOTIFICATION_LOG table based on regular 
expression + inclusion/exclusion list. So, in case of rename table event, the 
event will be ignored if old table doesn't match the pattern but the new table 
should be bootstrapped. REPL DUMP should have a mechanism to detect such tables 
and automatically bootstrap with incremental replication.Also, if renamed table 
is excluded from replication policy, then need to drop the old table at target 
as well. 

There are 4 scenarios that needs to be handled.
 # Both new name and old name satisfies the table name pattern filter.
 ## No need to do anything. The incremental event for rename should take care 
of the replication.
 # Both the names does not satisfy the table name pattern filter.
 ## Both the names are not in the scope of the policy and this nothing needs to 
be done.
 # New name satisfies the pattern but the old name does not.
 ## The table will not be present at the target.
 ## Rename event handler for dump should detect this case and add the new table 
name to the list of table for bootstrap.
 ## All the events related to the table (new name) should be ignored.
 ## If there is a drop event for the table (with new name), then remove the 
table from the list of tables to be bootstrapped.
 ## In case of rename (double rename)
 ### If the new name satisfies the table pattern, then add the new name to the 
list of tables to be bootstrapped and remove the old name from the list of 
tables to be bootstrapped.
 ### If the new name does not satisfies then just removed the table name from 
the list of tables to be bootstrapped.
 # New name does not satisfies the pattern but the old name satisfies.
 ## Change the rename event to a drop event.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21844) HMS schema Upgrade Script is failing with NPE

2019-06-06 Thread mahesh kumar behera (JIRA)
mahesh kumar behera created HIVE-21844:
--

 Summary: HMS schema Upgrade Script is failing with NPE
 Key: HIVE-21844
 URL: https://issues.apache.org/jira/browse/HIVE-21844
 Project: Hive
  Issue Type: Task
  Components: HiveServer2
Affects Versions: 4.0.0
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera
 Fix For: 4.0.0


schema upgrade tool is failing with NPE while executing "SELECT 'Upgrading 
MetaStore schema from 1.2.0 to 2.0.0' AS ' '". The header row (metadata) is 
coming with rows having value null. This is causing null pointer access in 
function TableOutputFormat::getOutputString when row.values[i] is accessed. 
Instead of " AS ' ' ", if some other value  like "AS dummy" is given, it's 
working fine.

 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21788) Support replication from haddop-2 (hive 3.0 and beelow) on-prem cluster to hadoop-3 (hive 4 and above) cloud cluster

2019-05-23 Thread mahesh kumar behera (JIRA)
mahesh kumar behera created HIVE-21788:
--

 Summary: Support replication from haddop-2 (hive 3.0 and beelow) 
on-prem cluster to hadoop-3 (hive 4 and above) cloud cluster
 Key: HIVE-21788
 URL: https://issues.apache.org/jira/browse/HIVE-21788
 Project: Hive
  Issue Type: Task
  Components: HiveServer2, repl
Affects Versions: 4.0.0
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera
 Fix For: 4.0.0


In case of replication to cloud both dump and load are executed in the source 
cluster. This push based replication is done to avoid computation at target 
cloud cluster. In case in the source cluster, strict managed table is not set 
to true the tables will be non acid. So during replication to a cluster with 
strict managed table, migration logic same as upgrade tool has to be applied on 
the replicated data. This migration logic is implemented only in hive4.0. So 
it's required that a hive 4.0 instance started at the source cluster. If the 
source cluster has hadoop-2 installation, hive4 has to be built with hadoop-2 
and necessary changes are required in the pom files and the shim files.

1. Change the pom.xml files to accept a profile for hadoop-2. If hadoop-2 
profile is set, the hadoop version should be set accordingly to hadoop-2.

2. In shim creare a new file for hadoop-2. Based on the profile the respective 
file will be included in the build.

3. Changed artifactId hadoop-hdfs-client to hadoop-client as in hadoop-2 the 
jars are stored under hadoop-client folder.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21775) Handling partition level stat replication

2019-05-21 Thread mahesh kumar behera (JIRA)
mahesh kumar behera created HIVE-21775:
--

 Summary: Handling partition level stat replication
 Key: HIVE-21775
 URL: https://issues.apache.org/jira/browse/HIVE-21775
 Project: Hive
  Issue Type: Sub-task
  Components: HiveServer2, repl
Affects Versions: 4.0.0
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera
 Fix For: 4.0.0


Statistics for table is maintained for all the partitions. The table level 
basic stats present in the table has the combined data for all the partitions. 
When only a few partitions are replicated, the replicated stats for the table 
may not be correct. In case of partition column stats, the aggregate stats from 
the partitions stats table will not correct. So the statistics replication can 
not be supported in case of partition levee replication.

TODO : Need to check how to handle it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21774) Support partition level filtering for events with multiple partitions

2019-05-21 Thread mahesh kumar behera (JIRA)
mahesh kumar behera created HIVE-21774:
--

 Summary: Support partition level filtering for events with 
multiple partitions
 Key: HIVE-21774
 URL: https://issues.apache.org/jira/browse/HIVE-21774
 Project: Hive
  Issue Type: Sub-task
  Components: HiveServer2, repl
Affects Versions: 4.0.0
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera
 Fix For: 4.0.0


Some of the events in hive can span across multiple partitions, table or even 
database. Events related to transactions, can span across multiple databases. 
When a transaction does some write operation, it is added to the write 
notification log table. During dump of commit transaction event, al the entries 
present in the write notification log table for that transaction is read and is 
added to the commit transaction message. In case partition filter is supplied 
for the dump, only those partitions which are part of the policy should be 
added to the commit txn message.
 * All the events which are not partition level will be added to the list of 
events to be dumped.
 * Pass the filter condition for the policy to commit transaction message 
handler (events which are not partition level).
 * During dump for commit transaction event, extract the events added in the 
write notification log table and compare it with the filter condition.
 * If the event from write notification log satisfies the filter condition, 
then add it to the commit transaction message.
 * If filter condition is null, then add all the events from write notification 
log table to commit transaction message.
 * For events which does not have partition level info like open txn, abort txn 
etc, just dump the events without any filtering. So it may happen that some of 
events which are not related to any of the satisfying partition, may get 
replayed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21773) Supporting external table replication with partition filter.

2019-05-21 Thread mahesh kumar behera (JIRA)
mahesh kumar behera created HIVE-21773:
--

 Summary: Supporting external table replication with partition 
filter.
 Key: HIVE-21773
 URL: https://issues.apache.org/jira/browse/HIVE-21773
 Project: Hive
  Issue Type: Sub-task
  Components: HiveServer2, repl
Affects Versions: 4.0.0
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera
 Fix For: 4.0.0


Hive external table replication is done differently than managed table 
replication. In case of external table, list is created for the locations of 
the table and partitions to be replicated. If the partition location is within 
the table location, then partition location is not added to the list. For 
partitions with location outside table, partition location is added to the 
list. In case of incremental dump, the data related events are ignored and just 
the metadata related events are dumped. The list of location is prepared and 
that is used for replication. During load, the events are replayed and then the 
distcp tasks are created, one for each location present in the list.

For partition level replication, not all partition will be present in the dump. 
So even if the partition locations are within the table location, each 
partition location will be added to the list.
 * If where condition is present in the REPL DUMP command then add location for 
each satisfying partition even though the partition location is within table 
location.
 * If table is not mentioned in the where clause then follow the older behavior.
 * If table is mentioned with a key but the key does not match any of the 
partitioned column then fail repl dump.
 * If the table is mentioned with the key and even if all the partitions are 
satisfying the filter condition, add location for each partition. This is to 
avoid copying partitions which are added using alter after the dump.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21772) Support dynamic addition and deletion of partitions in the policy

2019-05-21 Thread mahesh kumar behera (JIRA)
mahesh kumar behera created HIVE-21772:
--

 Summary: Support dynamic addition and deletion of partitions in 
the policy
 Key: HIVE-21772
 URL: https://issues.apache.org/jira/browse/HIVE-21772
 Project: Hive
  Issue Type: Sub-task
Affects Versions: 4.0.0
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera
 Fix For: 4.0.0


If the user modify the filter condition in the policy, then the participating 
partitions of the policy can change. During such scenarios, user needs to 
provide the old filter condition along with the REPL DUMP command.
 * The old filter will be passed as a string along with ‘with’ clause of the 
REPL dump command. Need to create the AST from the string to be used for 
filtering.
 * Convert the string to list of AST, one for each table and make a list of the 
partitions satisfying the old filter condition.
 * List of partition satisfying the new filter condition will be compared with 
the old list.
 * If the partition is not present in old but is present in new, then the 
partition will be added to the list of partitions to be bootstrapped.
 * If the partition is present in old, but not present in new then the 
partition will be added to the list of partitions to be deleted.
 * During load operation, after all the events are replayed, the list of 
bootstrap and list of deleted will be read and corresponding action will be 
executed at target.
 * There is a possibility that the partitioned to be deleted is already deleted 
using some event replayed, in that case delete will be ignored.
 * Similarly if some partition from the list of bootstrap is already present, 
then bootstrap will be ignored.
 * As the partition can not be present in both bootstrap and delete list, the 
list can be executed in parallel.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21771) Support partition filter (where clause) in REPL dump command

2019-05-21 Thread mahesh kumar behera (JIRA)
mahesh kumar behera created HIVE-21771:
--

 Summary: Support partition filter (where clause) in REPL dump 
command
 Key: HIVE-21771
 URL: https://issues.apache.org/jira/browse/HIVE-21771
 Project: Hive
  Issue Type: Sub-task
  Components: HiveServer2, repl
Affects Versions: 4.0.0
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera
 Fix For: 4.0.0


*Bootstrap for managed table*

User should be allowed to execute REPL DUMP with where clause. The where clause 
should support filtering out partition from dump. Format of the where clause 
should be similar to *"REPL DUMP dbname from 10 where t0 where key < 10, t1* 
where key = 3, [t2*,t3] where key > 3".* For initial version, very basic filter 
condition will be supported and later the complexity will be increased as and 
when required.
 * From the AST generated for the where clause, extract the table information.
 * Generate AST for each table.
 * List the partition for each table using the AST generated for each table 
using the   same metastore API used by select query.
 * During bootstrap load use the partition list to dump the partitions.
 * During incremental dump, use the list to filter out the event.

In case of bootstrap load, all the tables of the database will be scanned and
 * If table is not partitioned, then it will be dumped.
 * If key provided in the filter condition for the table is not a partition 
column, then dump will fail.
 * If table is not mentioned in the where clause, then all partitions of the 
table will be dumped.
 * All the partitioned of the table satisfying the where clause will be dumped.

*Incremental for managed table*

In case of Incremental Dump, the events from the notification log will be 
scanned and once the partition spec is extracted from the event, the partition 
spec will be filtered against the condition. 
 * If table is not partitioned then the event will be added to the dump.
 * If key mentioned is not a partition column, then dump will fail.
 * If the table is not mentioned in the filter then event will be added to the 
dump.
 * If the event is multi partitioned, then the event will be added to the dump. 
(Filtering out redundant partitions from message will be done as part of 
separate task).
 * If the partition spec matches the filter, then the event will be added to 
the dump*.*

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21770) Support extraction of replication spec from notification event.

2019-05-21 Thread mahesh kumar behera (JIRA)
mahesh kumar behera created HIVE-21770:
--

 Summary: Support extraction of replication spec from notification 
event. 
 Key: HIVE-21770
 URL: https://issues.apache.org/jira/browse/HIVE-21770
 Project: Hive
  Issue Type: Sub-task
  Components: HiveServer2, repl
Affects Versions: 4.0.0
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera
 Fix For: 4.0.0


The notification event structure currently does not have the partition spec. 
Events which can span across multiple databases and tables, the database and 
table info can not be obtained from the event structure. To know the event is 
added for which partition, the event message has to be deserialized and the 
partition information can be obtained from it. 
 * Each event handler has to expose a static API.
 * The API should take the event as input and return the list of db name, table 
name and partition spec from it.
 * If database name, table name or partition name is present in the event 
structure, then return it. If all these info are present then no need to 
deserialize the message. Later if these info are added to the event structure 
then it will be useful. 
 * Deserialize the message and create the list of name and return through a 
partition info class object.
 * If the table is not partitioned or is table level event, then set the 
partition info as null. Same for table info incase of db level events.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21769) Support Partition level filtering for hive replication command

2019-05-21 Thread mahesh kumar behera (JIRA)
mahesh kumar behera created HIVE-21769:
--

 Summary: Support Partition level filtering for hive replication 
command
 Key: HIVE-21769
 URL: https://issues.apache.org/jira/browse/HIVE-21769
 Project: Hive
  Issue Type: Task
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


# User should be able to dump and load events satisfying a filter based on 
partition specification.
 # The partitions included in each dump is not constant  and may vary between 
dumps.
 # User should be able to modify the policy in between to include/exclude 
partitions.
 # Only simple filter operator like >, <, >=, <= , ==, and, or against 
constants will be supported.
 # Configuration – Time Interval to filter out partitions if partition 
specification represents time (using ‘with’ clause in dump command). -- Will 
not be supported in first version.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21766) Select * returns no rows in hive bootstrap from a static or dynamic partitioned managed table with Timestamp type as partition column from on prem to WASB even though cou

2019-05-21 Thread mahesh kumar behera (JIRA)
mahesh kumar behera created HIVE-21766:
--

 Summary: Select * returns no rows in hive bootstrap from a static 
or dynamic partitioned managed table with Timestamp type as partition column 
from on prem to WASB even though count ( * ) matches
 Key: HIVE-21766
 URL: https://issues.apache.org/jira/browse/HIVE-21766
 Project: Hive
  Issue Type: Bug
Affects Versions: 4.0.0
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera
 Fix For: 4.0.0


*Cause:*
REPL LOAD replicates Txn State (writeIds of tables) to the target HMS (backend 
RDBMS). But, in this case, it is still connected to source HMS due to configs 
passed in WITH clause were not stored in HiveTxnManager. 
We pass the config object to the ReplTxnTask objects but HiveTxnManager was 
created by Driver using session config object.

*Fix:*
We need to pass it to HiveTxnManager too by creating a txn manager for repl txn 
operations with the config passed by user.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21731) Hive import fails, post upgrade of source 3.0 cluster, to a target 4.0 cluster with strict managed table set to true

2019-05-15 Thread mahesh kumar behera (JIRA)
mahesh kumar behera created HIVE-21731:
--

 Summary: Hive import fails, post upgrade of source 3.0 cluster, to 
a target 4.0 cluster with strict managed table set to true
 Key: HIVE-21731
 URL: https://issues.apache.org/jira/browse/HIVE-21731
 Project: Hive
  Issue Type: Bug
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


The scenario is 
 # Replication policy is set with hive  3.0 source cluster (strict managed 
table set to false) and hive 4.0 target cluster with strict managed table set  
true.
 # User upgrades the 3.0 source cluster to 4.0 cluster using upgrade tool.
 # The upgrade converts all managed tables to acid tables.
 # In the next repl dump, user sets hive .repl .dump .include .acid .tables and 
hive .repl .bootstrap. acid. tables set true triggering bootstrap of newly 
converted ACID tables.
 # As the old tables are non-txn tables, dump is not filtering the events even 
tough bootstrap acid table is set to true. This is causing the repl load to 
fail as the write id is not set in the table object.
 # If we ignore the event replay, the bootstrap is failing with dump directory 
mismatch error.

The fix should be 
 # Ignore dumping the alter table event if bootstrap acid table is set true and 
the alter is converting a non-acid table to acid table.
 # In case of bootstrap during incremental load, ignore the dump directory 
property set in table object.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21722) REPL::END event log is not included in hiveStatement.getQueryLog output.

2019-05-13 Thread mahesh kumar behera (JIRA)
mahesh kumar behera created HIVE-21722:
--

 Summary: REPL::END event log is not included in 
hiveStatement.getQueryLog output.
 Key: HIVE-21722
 URL: https://issues.apache.org/jira/browse/HIVE-21722
 Project: Hive
  Issue Type: Bug
  Components: HiveServer2, repl
Affects Versions: 4.0.0
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera
 Fix For: 4.0.0


getQueryLog only reads logs from Background thread scope. If parallel execution 
is set to true, a new thread is created for execution and all the logs added by 
the new thread are not added to the parent  Background thread scope. In 
replication scope, replStateLogTask are started in parallel mode causing the 
logs to be skipped from getQueryLog scope. 

There is one more issue, with the conf is not passed while creating 
replStateLogTask during bootstrap load end. The same issue is there with event 
load during incremental load. The incremental load end log task is created with 
the proper config. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21717) Rename is failing for directory in move task

2019-05-10 Thread mahesh kumar behera (JIRA)
mahesh kumar behera created HIVE-21717:
--

 Summary: Rename is failing for directory in move task 
 Key: HIVE-21717
 URL: https://issues.apache.org/jira/browse/HIVE-21717
 Project: Hive
  Issue Type: Bug
  Components: HiveServer2
Affects Versions: 4.0.0
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera
 Fix For: 4.0.0


Rename fails with destination directory not empty in case a directory is move 
directly to the table location from staging directory as rename cannot 
overwrite non empty destination directory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21712) Replication scenarios should tested with hive.strict.managed.tables set to true

2019-05-08 Thread mahesh kumar behera (JIRA)
mahesh kumar behera created HIVE-21712:
--

 Summary: Replication scenarios should tested with 
hive.strict.managed.tables set to true
 Key: HIVE-21712
 URL: https://issues.apache.org/jira/browse/HIVE-21712
 Project: Hive
  Issue Type: Bug
  Components: Hive, repl
Affects Versions: 4.0.0
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera
 Fix For: 4.0.0


In replication test suites, in some cases the tables are created with 
transactional property set to non-acid and thus the intended tests are missing. 
By setting the default value of hive.strict.managed.tables to true in 
replication related test suites, the tables will be created as ACID tables by 
default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21700) hive incremental load going OOM while adding load task to the leaf nodes of the DAG

2019-05-07 Thread mahesh kumar behera (JIRA)
mahesh kumar behera created HIVE-21700:
--

 Summary: hive incremental load going OOM while adding load task to 
the leaf nodes of the DAG
 Key: HIVE-21700
 URL: https://issues.apache.org/jira/browse/HIVE-21700
 Project: Hive
  Issue Type: Bug
  Components: Hive
Affects Versions: 4.0.0
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera
 Fix For: 4.0.0


While listing thee child nodes to check for leaf node, we need to filter out 
tasks which are already added to the children list. If a task is added multiple 
time to the children list then it may cause the list to grow exponentially. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21694) Hive driver waiting time is fixed for task getting executed in parallel.

2019-05-05 Thread mahesh kumar behera (JIRA)
mahesh kumar behera created HIVE-21694:
--

 Summary: Hive driver waiting time is fixed for task getting 
executed in parallel.
 Key: HIVE-21694
 URL: https://issues.apache.org/jira/browse/HIVE-21694
 Project: Hive
  Issue Type: Bug
  Components: HiveServer2
Affects Versions: 4.0.0
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera
 Fix For: 4.0.0


During a command execution hive driver executes the task in a separate thread 
if the task to be executed is set as parallel. After starting the task, driver 
checks if the task has finished execution or not. If the task execution is not 
finished it waits for 2 seconds before waking up again to check the task 
status. In case of task with execution time in milliseconds, this wait time can 
induce substantial overhead. So instead of fixed wait time, exponential 
backedup sleep time can be used to reduce the sleep overhead. The sleep time 
can start with 100ms and can increase up to 2 seconds doubling on each 
iteration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21566) Support locking during ACID table replication

2019-04-03 Thread mahesh kumar behera (JIRA)
mahesh kumar behera created HIVE-21566:
--

 Summary: Support locking during ACID table replication  
 Key: HIVE-21566
 URL: https://issues.apache.org/jira/browse/HIVE-21566
 Project: Hive
  Issue Type: Bug
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


During load of ACID table we need to take lock.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21450) Buffer Reader is not closed during executeInitSql

2019-03-14 Thread mahesh kumar behera (JIRA)
mahesh kumar behera created HIVE-21450:
--

 Summary: Buffer Reader is not closed during executeInitSql
 Key: HIVE-21450
 URL: https://issues.apache.org/jira/browse/HIVE-21450
 Project: Hive
  Issue Type: Bug
  Components: JDBC
Affects Versions: 4.0.0
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera
 Fix For: 4.0.0


The buffer reader should be opened within try block to close it after execution



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21446) Hive Server going OOM during hive external table replications

2019-03-13 Thread mahesh kumar behera (JIRA)
mahesh kumar behera created HIVE-21446:
--

 Summary: Hive Server going OOM during hive external table 
replications
 Key: HIVE-21446
 URL: https://issues.apache.org/jira/browse/HIVE-21446
 Project: Hive
  Issue Type: Bug
  Components: repl
Affects Versions: 4.0.0
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera
 Fix For: 4.0.0


The file system objects opened using proxy users are not closed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21325) Hive external table replication failed with Permission denied issue.

2019-02-26 Thread mahesh kumar behera (JIRA)
mahesh kumar behera created HIVE-21325:
--

 Summary: Hive external table replication failed with Permission 
denied issue.
 Key: HIVE-21325
 URL: https://issues.apache.org/jira/browse/HIVE-21325
 Project: Hive
  Issue Type: Bug
Affects Versions: 4.0.0
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera
 Fix For: 4.0.0


During external table replication the file copy is done in parallel to the meta 
data replication. If the file copy task creates the directory with do as set to 
true, it will create the directory with permission set to the user running the 
repl command. In that case the meta data task while creating the table may fail 
as hive user might not have access to the created directory.

The fix should be
 # While creating directory, if sql based authentication is enabled, then 
disable storage based authentication for hive user.
 # Currently the created directory has the login user access, it should retain 
the source clusters owner, group and permission.
 # For external table replication don't create the directory during create 
table and add partition.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21314) Hive Replication not retaining the owner in the replicated table

2019-02-24 Thread mahesh kumar behera (JIRA)
mahesh kumar behera created HIVE-21314:
--

 Summary: Hive Replication not retaining the owner in the 
replicated table
 Key: HIVE-21314
 URL: https://issues.apache.org/jira/browse/HIVE-21314
 Project: Hive
  Issue Type: Bug
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


Hive Replication not retaining the owner in the replicated table. The owner for 
the target table is set same as the user executing the load command. The user 
information should be read from the dump metadata and should be used while 
creating the table at target cluster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21260) Hive 3 (onprem) -> 4(onprem): Hive replication failed due to postgres sql execution issue

2019-02-12 Thread mahesh kumar behera (JIRA)
mahesh kumar behera created HIVE-21260:
--

 Summary: Hive 3 (onprem) -> 4(onprem): Hive replication failed due 
to postgres sql execution issue
 Key: HIVE-21260
 URL: https://issues.apache.org/jira/browse/HIVE-21260
 Project: Hive
  Issue Type: Bug
Affects Versions: 4.0.0
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera
 Fix For: 4.0.0


Missing quotes in sql string is causing sql execution error for postgres.

 
{code:java}
metastore.RetryingHMSHandler (RetryingHMSHandler.java:invokeInternal(201)) - 
MetaException(message:Unable to update transaction database 
org.postgresql.util.PSQLException: ERROR: relat
ion "database_params" does not exist
Position: 25
at 
org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2284)
at 
org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2003)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:200)
at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:424)
at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:321)
at org.postgresql.jdbc.PgStatement.executeQuery(PgStatement.java:284)
at com.zaxxer.hikari.pool.ProxyStatement.executeQuery(ProxyStatement.java:108)
at 
com.zaxxer.hikari.pool.HikariProxyStatement.executeQuery(HikariProxyStatement.java)
at 
org.apache.hadoop.hive.metastore.txn.TxnHandler.updateReplId(TxnHandler.java:907)
at 
org.apache.hadoop.hive.metastore.txn.TxnHandler.commitTxn(TxnHandler.java:1023)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.commit_txn(HiveMetaStore.java:7703)
at sun.reflect.GeneratedMethodAccessor43.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:147)
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:108)
at com.sun.proxy.$Proxy39.commit_txn(Unknown Source)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$commit_txn.getResult(ThriftHiveMetastore.java:18730)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$commit_txn.getResult(ThriftHiveMetastore.java:18714)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at 
org.apache.hadoop.hive.metastore.security.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor$1.run(HadoopThriftAuthBridge.java:636)
at 
org.apache.hadoop.hive.metastore.security.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor$1.run(HadoopThriftAuthBridge.java:631)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at 
org.apache.hadoop.hive.metastore.security.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor.process(HadoopThriftAuthBridge.java:631)
at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
){code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21213) Acid table bootstrap replication needs to handle directory created by compaction with txn id

2019-02-04 Thread mahesh kumar behera (JIRA)
mahesh kumar behera created HIVE-21213:
--

 Summary: Acid table bootstrap replication needs to handle 
directory created by compaction with txn id
 Key: HIVE-21213
 URL: https://issues.apache.org/jira/browse/HIVE-21213
 Project: Hive
  Issue Type: Sub-task
  Components: Hive, HiveServer2, repl
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


The current implementation of compaction makes use of compaction to use the txn 
id in the directory name. This is used to isolate the queries from reading the 
directory until compaction has finished. In case of replication, the directory 
can not be copied as the txn list at target may be different from source. So 
conversion logic is required to create a new directory with valid txn at target 
and dump the data to the newly created directory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21197) Hive Replication can add duplicate data during migration from 3.0 to 4

2019-01-31 Thread mahesh kumar behera (JIRA)
mahesh kumar behera created HIVE-21197:
--

 Summary: Hive Replication can add duplicate data during migration 
from 3.0 to 4
 Key: HIVE-21197
 URL: https://issues.apache.org/jira/browse/HIVE-21197
 Project: Hive
  Issue Type: Task
  Components: repl
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


During bootstrap phase it may happen that the files copied to target are 
created by events which are not part of the bootstrap. This is because of the 
fact that, bootstrap first gets the last event id and then the file list. So 
during this period if some event happens, then bootstrap will include files 
created by these events also. So the same files will be copied again during the 
first incremental replication just after the bootstrap. In normal scenario, the 
duplicate copy does not cause any issue as hive allows the use of target 
database only after the first incremental. But in case of migration, the file 
at source and target are copied to different location (based on the write id at 
target) and thus this may lead to duplicate data at target. This can be avoided 
by having at check at load time for duplicate file. This check can be done only 
for the first incremental and the search can be done in the bootstrap directory 
(with write id 1). if the file is already present then just ignore the copy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21063) Support statistics in cachedStore for transactional table

2018-12-20 Thread mahesh kumar behera (JIRA)
mahesh kumar behera created HIVE-21063:
--

 Summary: Support statistics in cachedStore for transactional table
 Key: HIVE-21063
 URL: https://issues.apache.org/jira/browse/HIVE-21063
 Project: Hive
  Issue Type: Task
Reporter: mahesh kumar behera


Currently statistics for transactional table is not stored in cached store for 
consistency issues. Need to add validation for valid write ids and generation 
of aggregate stats based on valid partitions. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21055) Replication to a target cluster with hive.strict.managed.tables enabled executing copy in serial mode

2018-12-18 Thread mahesh kumar behera (JIRA)
mahesh kumar behera created HIVE-21055:
--

 Summary: Replication to a target cluster with 
hive.strict.managed.tables enabled executing copy in serial mode
 Key: HIVE-21055
 URL: https://issues.apache.org/jira/browse/HIVE-21055
 Project: Hive
  Issue Type: Bug
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


For repl load command use can specify the execution mode as part of "with" 
clause. But the config for executing task in parallel or serial is not read 
from the command specific config. It is read from the hive server config. So 
even if user specifies to run the tasks in parallel during repl load command, 
the tasks are getting executed serially.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21023) Add test for replication to a target with hive.strict.managed.tables enabled

2018-12-10 Thread mahesh kumar behera (JIRA)
mahesh kumar behera created HIVE-21023:
--

 Summary: Add test for replication to a target with 
hive.strict.managed.tables enabled
 Key: HIVE-21023
 URL: https://issues.apache.org/jira/browse/HIVE-21023
 Project: Hive
  Issue Type: Bug
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera


Tests added are timing out in ptest run. Need to skip these test cases from 
batching and run them separately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-20966) Support incremental replication to a target cluster with hive.strict.managed.tables enabled.

2018-11-25 Thread mahesh kumar behera (JIRA)
mahesh kumar behera created HIVE-20966:
--

 Summary: Support incremental replication to a target cluster with 
hive.strict.managed.tables enabled.
 Key: HIVE-20966
 URL: https://issues.apache.org/jira/browse/HIVE-20966
 Project: Hive
  Issue Type: New Feature
  Components: repl
Affects Versions: 4.0.0
Reporter: mahesh kumar behera
Assignee: Sankar Hariappan


*Requirements:*
 - Support Hive replication with Hive2 as master and Hive3 as slave where 
hive.strict.managed.tables is enabled.
 - The non-ACID managed tables from Hive2 should be converted to appropriate 
ACID or MM tables or to an external table based on Hive3 table type rules.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-20967) CLONE - REPL DUMP to dump the default warehouse directory of source.

2018-11-25 Thread mahesh kumar behera (JIRA)
mahesh kumar behera created HIVE-20967:
--

 Summary: CLONE - REPL DUMP to dump the default warehouse directory 
of source.
 Key: HIVE-20967
 URL: https://issues.apache.org/jira/browse/HIVE-20967
 Project: Hive
  Issue Type: Sub-task
  Components: repl
Affects Versions: 4.0.0
Reporter: mahesh kumar behera
Assignee: Sankar Hariappan


The default warehouse directory of the source is needed by target to detect if 
DB or table location is set by user or assigned by Hive. 
Using this information, REPL LOAD will decide to preserve the path or move data 
to default managed table's warehouse directory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   >