from:"Mithun Radhakrishnan \(JIRA\)"

[jira] [Created] (HIVE-18608) ORC should allow selectively disabling dictionary-encoding on specified columns

2018-02-01 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-18608:
---

 Summary: ORC should allow selectively disabling 
dictionary-encoding on specified columns
 Key: HIVE-18608
 URL: https://issues.apache.org/jira/browse/HIVE-18608
 Project: Hive
  Issue Type: New Feature
  Components: ORC
Affects Versions: 3.0.0, 2.4.0, 2.2.1
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan


Just as ORC allows the choice of columns to enable bloom-filters on, it would 
be nice to have a way to specify which columns {{DICTIONARY_V2}} encoding 
should be disabled on.

Currently, the choice of dictionary-encoding depends on the results of sampling 
the first row-stride within a stripe. If the user knows that a column's 
cardinality is bound to prevent an effective dictionary, she might choose to 
simply disable it on just that column, and avoid the cost of sampling in the 
first row-stride.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (HIVE-18374) Update committer-list

2018-01-04 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-18374:
---

 Summary: Update committer-list
 Key: HIVE-18374
 URL: https://issues.apache.org/jira/browse/HIVE-18374
 Project: Hive
  Issue Type: Task
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
Priority: Trivial


I'm afraid I need to make a trivial change to my organization affiliation:

{code:xml}

mithun
Mithun Radhakrishnan
https://oath.com/;>Oath

{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (HIVE-18199) Add unit-test for lost UGI doAs() context in RetryingMetaStoreClient

2017-12-01 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-18199:
---

 Summary: Add unit-test for lost UGI doAs() context in 
RetryingMetaStoreClient
 Key: HIVE-18199
 URL: https://issues.apache.org/jira/browse/HIVE-18199
 Project: Hive
  Issue Type: Bug
  Components: Metastore
Reporter: Mithun Radhakrishnan
Assignee: Chris Drome


This has to do with HIVE-17853. The {{RetryingMetaStoreClient}} would lose the 
{{UGI.doAs()}} context, in case of a socket timeout. The connection after the 
reconnect might cause operation failures, because of using the wrong user.

We'll need to add a unit-test to simulate this case, if possib.e



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (HIVE-17949) itests compile is busted on branch-1.2

2017-10-31 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-17949:
---

 Summary: itests compile is busted on branch-1.2
 Key: HIVE-17949
 URL: https://issues.apache.org/jira/browse/HIVE-17949
 Project: Hive
  Issue Type: Bug
  Components: Test
Affects Versions: 1.2.3
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan


{{commit 18ddf46e0a8f092358725fc102235cbe6ba3e24d}} on {{branch-1.2}} was for 
{{Preparing for 1.2.3 development}}. This should have also included 
corresponding changes to all the pom-files under {{itests}}. As it stands now, 
the build fails with the following:

{noformat}
[ERROR]   location: class org.apache.hadoop.hive.metastore.api.Role
[ERROR] 
/Users/mithunr/workspace/dev/hive/apache/branch-1.2/itests/hive-unit/src/test/java/org/apache/hadoop/hive/metastore/TestHiveMetaStore.java:[512,19]
 no suitable method found for 
updatePartitionStatsFast(org.apache.hadoop.hive.metastore.api.Partition,org.apache.hadoop.hive.metastore.Warehouse)
[ERROR] method 
org.apache.hadoop.hive.metastore.MetaStoreUtils.updatePartitionStatsFast(org.apache.hadoop.hive.metastore.api.Partition,org.apache.hadoop.hive.metastore.Warehouse,org.apache.hadoop.hive.metastore.api.EnvironmentContext)
 is not applicable
[ERROR]   (actual and formal argument lists differ in length)
[ERROR] method 
org.apache.hadoop.hive.metastore.MetaStoreUtils.updatePartitionStatsFast(org.apache.hadoop.hive.metastore.api.Partition,org.apache.hadoop.hive.metastore.Warehouse,boolean,org.apache.hadoop.hive.metastore.api.EnvironmentContext)
 is not applicable
[ERROR]   (actual and formal argument lists differ in length)
[ERROR] method 
org.apache.hadoop.hive.metastore.MetaStoreUtils.updatePartitionStatsFast(org.apache.hadoop.hive.metastore.api.Partition,org.apache.hadoop.hive.metastore.Warehouse,boolean,boolean,org.apache.hadoop.hive.metastore.api.EnvironmentContext)
 is not applicable
[ERROR]   (actual and formal argument lists differ in length)
[ERROR] method 
org.apache.hadoop.hive.metastore.MetaStoreUtils.updatePartitionStatsFast(org.apache.hadoop.hive.metastore.partition.spec.PartitionSpecProxy.PartitionIterator,org.apache.hadoop.hive.metastore.Warehouse,boolean,boolean,org.apache.hadoop.hive.metastore.api.EnvironmentContext)
 is not applicable
[ERROR]   (actual and formal argument lists differ in length)
[ERROR] 
/Users/mithunr/workspace/dev/hive/apache/branch-1.2/itests/hive-unit/src/test/java/org/apache/hadoop/hive/metastore/TestHiveMetaStoreWithEnvironmentContext.java:[181,45]
 incompatible types: org.apache.hadoop.hive.metastore.api.EnvironmentContext 
cannot be converted to boolean
[ERROR] 
/Users/mithunr/workspace/dev/hive/apache/branch-1.2/itests/hive-unit/src/test/java/org/apache/hadoop/hive/metastore/TestHiveMetaStoreWithEnvironmentContext.java:[190,45]
 incompatible types: org.apache.hadoop.hive.metastore.api.EnvironmentContext 
cannot be converted to boolean
[ERROR] 
/Users/mithunr/workspace/dev/hive/apache/branch-1.2/itests/hive-unit/src/test/java/org/apache/hadoop/hive/thrift/TestZooKeeperTokenStore.java:[53,26]
 cannot find symbol
[ERROR]   symbol:   class MiniZooKeeperCluster
[ERROR]   location: class org.apache.hadoop.hive.thrift.TestZooKeeperTokenStore
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please 
read the following articles:
[ERROR] [Help 1] 
http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn  -rf :hive-it-unit
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (HIVE-17940) IllegalArgumentException when reading last row-group in an ORC stripe

2017-10-30 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-17940:
---

 Summary: IllegalArgumentException when reading last row-group in 
an ORC stripe
 Key: HIVE-17940
 URL: https://issues.apache.org/jira/browse/HIVE-17940
 Project: Hive
  Issue Type: Bug
  Components: ORC
Affects Versions: 1.2.2, 1.3.0
Reporter: Mithun Radhakrishnan
Assignee: Chris Drome


(This is a backport of HIVE-10024 to {{branch-1.2}}, and {{branch-1}}.)

When the last row-group in an ORC stripe contains fewer records than specified 
in {{\$\{orc.row.index.stride\}}}, and if a column value is sparse (i.e. mostly 
nulls), then one sees the following failure when reading the ORC stripe:

{noformat}
 java.lang.IllegalArgumentException: Seek in Stream for column 82 kind DATA to 
130 is outside of the data
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:171)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137)
at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:347)
at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179)
at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1738)
at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171)
at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
java.io.IOException: java.lang.IllegalArgumentException: Seek in Stream for 
column 82 kind DATA to 130 is outside of the data
at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:71)
at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:322)
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:148)
... 14 more
{noformat}

[~sershe] had a fix for this in HIVE-10024, in {{branch-2}}. After running into 
this in production with {{branch-1}}+, we find that the fix for HIVE-10024 
sorts this out in {{branch-1}} as well.

This is a fairly rare case, but it leads to bad reads on valid ORC files. I 
will back-port this shortly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (HIVE-17853) RetryingMetaStoreClient loses UGI impersonation-context when reconnecting after timeout

2017-10-19 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-17853:
---

 Summary: RetryingMetaStoreClient loses UGI impersonation-context 
when reconnecting after timeout
 Key: HIVE-17853
 URL: https://issues.apache.org/jira/browse/HIVE-17853
 Project: Hive
  Issue Type: Bug
  Components: Metastore
Affects Versions: 3.0.0, 2.4.0, 2.2.1
Reporter: Mithun Radhakrishnan
Assignee: Chris Drome
Priority: Critical


The {{RetryingMetaStoreClient}} is used to automatically reconnect to the Hive 
metastore, after client timeout, transparently to the user.

In case of user impersonation (e.g. Oozie super-user {{oozie}} impersonating a 
Hadoop user {{mithun}}, to run a workflow), in case of timeout, we find that 
the reconnect causes the {{UGI.doAs()}} context to be lost. Any further 
metastore operations will be attempted as the login-user ({{oozie}}), as 
opposed to the effective user ({{mithunr}}).

We should have a fix for this shortly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (HIVE-17803) With Pig multi-query, 2 HCatStorers writing to the same table will trample each other's outputs

2017-10-13 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-17803:
---

 Summary: With Pig multi-query, 2 HCatStorers writing to the same 
table will trample each other's outputs
 Key: HIVE-17803
 URL: https://issues.apache.org/jira/browse/HIVE-17803
 Project: Hive
  Issue Type: Bug
  Components: HCatalog
Affects Versions: 2.2.0, 3.0.0
Reporter: Mithun Radhakrishnan
Assignee: Chris Drome


When Pig scripts use multi-query and {{HCatStorer}} with dynamic-partitioning, 
and use more than one {{HCatStorer}} instance to write to the same table, they 
might trample on each other's outputs. The failure looks as follows:

{noformat}
Caused by: org.apache.hive.hcatalog.common.HCatException : 2006 : Error adding 
partition to metastore. Cause : 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
 No lease on /projects/foo/bar/activity_date=2016022306/_placeholder (inode 
2878224200): File does not exist. [Lease.  Holder: 
DFSClient_NONMAPREDUCE_-1281544466_4952, pendingcreates: 1]
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3429)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3517)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:3484)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:791)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:537)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:608)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server.call(Server.java:2267)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:648)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:615)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2217)

at 
org.apache.hive.hcatalog.mapreduce.FileOutputCommitterContainer.registerPartitions(FileOutputCommitterContainer.java:1022)
at 
org.apache.hive.hcatalog.mapreduce.FileOutputCommitterContainer.commitJob(FileOutputCommitterContainer.java:269)
... 20 more
Caused by: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
 No lease on /projects/foo/bar/activity_date=2016022306/_placeholder (inode 
2878224200): File does not exist. [Lease.  Holder: 
DFSClient_NONMAPREDUCE_-1281544466_4952, pendingcreates: 1]
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3429)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3517)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:3484)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:791)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:537)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:608)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server.call(Server.java:2267)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:648)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:615)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2217)

at org.apache.hadoop.ipc.Client.call(Client.java:1457)
at org.apache.hadoop.ipc.Client.call(Client.java:1394)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
at com.sun.proxy.$Proxy11.complete(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.complete(ClientNamenodeProtocolTranslatorPB.java:462)
at

[jira] [Created] (HIVE-17802) Remove unnecessary calls to FileSystem.setOwner() from FileOutputCommitterContainer

2017-10-13 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-17802:
---

 Summary: Remove unnecessary calls to FileSystem.setOwner() from 
FileOutputCommitterContainer
 Key: HIVE-17802
 URL: https://issues.apache.org/jira/browse/HIVE-17802
 Project: Hive
  Issue Type: Bug
  Components: HCatalog
Affects Versions: 2.2.0, 3.0.0
Reporter: Mithun Radhakrishnan
Assignee: Chris Drome


For large Pig/HCat queries that produce a large number of 
partitions/directories/files, we have seen cases where the HDFS NameNode 
groaned under the weight of {{FileSystem.setOwner()}} calls, originating from 
the commit-step. This was the result of the following code in 
FileOutputCommitterContainer:
{code:java}
private void applyGroupAndPerms(FileSystem fs, Path dir, FsPermission 
permission,
  List acls, String group, boolean recursive)
throws IOException {
...
if (recursive) {
  for (FileStatus fileStatus : fs.listStatus(dir)) {
if (fileStatus.isDir()) {
  applyGroupAndPerms(fs, fileStatus.getPath(), permission, acls, group, 
true);
} else {
  fs.setPermission(fileStatus.getPath(), permission);
  chown(fs, fileStatus.getPath(), group);
}
  }
}
  }

  private void chown(FileSystem fs, Path file, String group) throws IOException 
{
try {
  fs.setOwner(file, null, group);
} catch (AccessControlException ignore) {
  // Some users have wrong table group, ignore it.
  LOG.warn("Failed to change group of partition directories/files: " + 
file, ignore);
}
  }
{code}

One call per file/directory is far too many. We have a patch that reduces the 
namenode pressure.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (HIVE-17794) HCatLoader breaks when a member is added to a struct-column of a table

2017-10-12 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-17794:
---

 Summary: HCatLoader breaks when a member is added to a 
struct-column of a table
 Key: HIVE-17794
 URL: https://issues.apache.org/jira/browse/HIVE-17794
 Project: Hive
  Issue Type: Bug
  Components: HCatalog
Affects Versions: 2.2.0, 3.0.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan


When a table's schema evolves to add a new member to a struct column, Hive 
queries work fine, but {{HCatLoader}} breaks with the following trace:

{noformat}
TaskAttempt 1 failed, info=
 Error: Failure while running 
task:org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception 
while executing (Name: kite_composites_with_segments: Local Rearrange
 tuple
{chararray}(false) - scope-555-> scope-974 Operator Key: scope-555): 
org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while 
executing (Name: gup: New For Each(false,false)
 bag
- scope-548 Operator Key: scope-548): 
org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while 
executing (Name: gup_filtered: Filter
 bag
- scope-522 Operator Key: scope-522): 
org.apache.pig.backend.executionengine.ExecException: ERROR 0: 
org.apache.pig.backend.executionengine.ExecException: ERROR 6018: Error 
converting read value to tuple
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:314)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNextTuple(POLocalRearrange.java:287)
at 
org.apache.pig.backend.hadoop.executionengine.tez.plan.operator.POLocalRearrangeTez.getNextTuple(POLocalRearrangeTez.java:127)
at 
org.apache.pig.backend.hadoop.executionengine.tez.runtime.PigProcessor.runPipeline(PigProcessor.java:376)
at 
org.apache.pig.backend.hadoop.executionengine.tez.runtime.PigProcessor.run(PigProcessor.java:241)
at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:362)
at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179)
at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171)
at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 0: 
Exception while executing (Name: gup: New For Each(false,false)
 bag
- scope-548 Operator Key: scope-548): 
org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while 
executing (Name: gup_filtered: Filter
 bag
- scope-522 Operator Key: scope-522): 
org.apache.pig.backend.executionengine.ExecException: ERROR 0: 
org.apache.pig.backend.executionengine.ExecException: ERROR 6018: Error 
converting read value to tuple
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:314)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:252)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:305)
... 17 more
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 0: 
Exception while executing (Name: gup_filtered: Filter
 bag
- scope-522 Operator Key: scope-522): 
org.apache.pig.backend.executionengine.ExecException: ERROR 0: 
org.apache.pig.backend.executionengine.ExecException: ERROR 6018: Error 
converting read value to tuple
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:314)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNextTuple(POFilter.java:90)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:305)
... 19 more
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 0: 
org.apache.pig.backend.executionengine.ExecException: ERROR 6018: Error 
converting read value to tuple
at 
org.apache.pig.backend.hadoop.executionengine.tez.plan.operator.POSimpleTezLoad.getNextTuple(POSimpleTezLoad.java:160)

[jira] [Created] (HIVE-17791) Temp dirs under the staging directory should honour `inheritPerms`

2017-10-12 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-17791:
---

 Summary: Temp dirs under the staging directory should honour 
`inheritPerms`
 Key: HIVE-17791
 URL: https://issues.apache.org/jira/browse/HIVE-17791
 Project: Hive
  Issue Type: Bug
  Components: Authorization
Reporter: Mithun Radhakrishnan
Assignee: Chris Drome


For [~cdrome]:

CLI creates two levels of staging directories but calls setPermissions on the 
top-level directory only if {{hive.warehouse.subdir.inherit.perms=true}}.

The top-level directory, 
{{/user/cdrome/hive/words_text_dist/dt=c/.hive-staging_hive_2016-07-15_08-44-22_082_5534649671389063929-1}}
 is created the first time {{Context.getExternalTmpPath}} is called.

The child directory, 
{{/user/cdrome/hive/words_text_dist/dt=c/.hive-staging_hive_2016-07-15_08-44-22_082_5534649671389063929-1/_tmp.-ext-1}}
 is created when {{TezTask.execute}} is called at line 164:

{code:java}
DAG dag = build(jobConf, work, scratchDir, appJarLr, additionalLr, ctx);
{code}

This calls {{DagUtils.createVertex}}, which calls {{Utilities.createTmpDirs}}:

{code:java}
3770   private static void createTmpDirs(Configuration conf,
3771   List ops) throws IOException {
3772 
3773 while (!ops.isEmpty()) {
3774   Operator op = ops.remove(0);
3775 
3776   if (op instanceof FileSinkOperator) {
3777 FileSinkDesc fdesc = ((FileSinkOperator) op).getConf();
3778 Path tempDir = fdesc.getDirName();
3779 
3780 if (tempDir != null) {
3781   Path tempPath = Utilities.toTempPath(tempDir);
3782   FileSystem fs = tempPath.getFileSystem(conf);
3783   fs.mkdirs(tempPath); // <-- HERE!
3784 }
3785   }
3786 
3787   if (op.getChildOperators() != null) {
3788 ops.addAll(op.getChildOperators());
3789   }
3790 }
3791   }
{code}

It turns out that {{inheritPerms}} is no longer part of {{master}}. I'll rebase 
this for {{branch-2}}, and {{branch-2.2}}. {{master}} will have to wait till 
the issues around {{StorageBasedAuthProvider}}, directory permissions, etc. are 
sorted out.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (HIVE-17784) Make Tez AM's Queue headroom calculation and nParallel tasks configurable.

2017-10-12 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-17784:
---

 Summary: Make Tez AM's Queue headroom calculation and nParallel 
tasks configurable.
 Key: HIVE-17784
 URL: https://issues.apache.org/jira/browse/HIVE-17784
 Project: Hive
  Issue Type: Bug
  Components: Query Planning, Tez
Affects Versions: 2.2.0, 3.0.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan


Here's a couple of customizations we made at Yahoo with Hive Tez AMs:
# When calculating splits, {{HiveSplitGenerator}} takes the entire queue's 
capacity as available, and generates splits accordingly. While this greedy 
algorithm might be acceptable for exclusive queues, on a shared queue, greedy 
queries will hold other queries up. The algorithm that calculates the queue's 
headroom should be pluggable. The greedy version can be the default.
# {{TEZ_AM_VERTEX_MAX_TASK_CONCURRENCY}} and the AM's heap-size can be tuned 
separately from the AM's container size. We found that users who attempt to 
increase vertex concurrency tend to forget to bump AM memory/container sizes. 
It would be handier if those values were derived from the container size.

I'm combining these into a single patch, for easier review.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (HIVE-17781) Map MR settings to Tez settings via DeprecatedKeys

2017-10-11 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-17781:
---

 Summary: Map MR settings to Tez settings via DeprecatedKeys
 Key: HIVE-17781
 URL: https://issues.apache.org/jira/browse/HIVE-17781
 Project: Hive
  Issue Type: Bug
  Components: Configuration, Tez
Affects Versions: 3.0.0
Reporter: Mithun Radhakrishnan
Assignee: Chris Drome


Here's one that [~cdrome] and [~thiruvel] worked on:

We found that certain Hadoop Map/Reduce settings that are set in site config 
files do not take effect in Hive jobs, because the Tez site configs do not 
contain the same settings.

In Yahoo's case, the problem was that, at the time, there was no mapping 
between {{MRJobConfig.COMPLETED_MAPS_FOR_REDUCE_SLOWSTART}} and 
{{TEZ_SHUFFLE_VERTEX_MANAGER_MAX_SRC_FRACTION}}. There were situations where 
significant capacity on production clusters were being used up doing nothing, 
while waiting for slow tasks to complete. This would have been avoided, were 
the mappings in place.

Tez provides a {{DeprecatedKeys}} utility class, to help map MR settings to Tez 
settings. Hive should use this to ensure that the mappings are in sync.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (HIVE-17770) HCatalog documentation for Pig type-mapping incorrect for "bag" types

2017-10-10 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-17770:
---

 Summary: HCatalog documentation for Pig type-mapping incorrect for 
"bag" types
 Key: HIVE-17770
 URL: https://issues.apache.org/jira/browse/HIVE-17770
 Project: Hive
  Issue Type: Bug
  Components: Documentation, HCatalog
Affects Versions: 2.2.0, 3.0.0
Reporter: Mithun Radhakrishnan
Assignee: Chris Drome
Priority: Minor


Raising on behalf of [~cdrome], to track a change in documentation.

The [HCatalog LoadStore type-mapping 
documentation|https://cwiki.apache.org/confluence/display/Hive/HCatalog+LoadStore#HCatalogLoadStore-ComplexTypes]
 mentions the following:

||Hive Type||Pig Type||
|map (key type should be string)|map|
|*List<_any type_>*|bag|
|struct|tuple|

We should change {{List<_any type_>}} to {{ARRAY<_any type_>}}, as per the 
description of Hive's complex types, in [the language 
manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-ComplexTypes].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (HIVE-17763) HCatLoader should fetch delegation tokens for partitions on remote HDFS

2017-10-10 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-17763:
---

 Summary: HCatLoader should fetch delegation tokens for partitions 
on remote HDFS
 Key: HIVE-17763
 URL: https://issues.apache.org/jira/browse/HIVE-17763
 Project: Hive
  Issue Type: Bug
  Components: HCatalog, Security
Affects Versions: 2.2.0, 3.0.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan


The Hive metastore might store partition-info for data stored on a remote HDFS 
(i.e. different from what's defined by {{fs.default.name}}. {{HCatLoader}} 
should automatically fetch delegation-tokens for all remote HDFSes that 
participate in an HCat-based query.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (HIVE-17754) InputJobInfo in Pig UDFContext is heavyweight, and causes OOMs in Tez AMs

2017-10-10 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-17754:
---

 Summary: InputJobInfo in Pig UDFContext is heavyweight, and causes 
OOMs in Tez AMs
 Key: HIVE-17754
 URL: https://issues.apache.org/jira/browse/HIVE-17754
 Project: Hive
  Issue Type: Bug
  Components: HCatalog
Affects Versions: 2.2.0, 3.0.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan


HIVE-9845 dealt with reducing the size of HCat split-info, to improve 
job-launch times for Pig/HCat jobs.
For large Pig queries that scan a large number of Hive partitions, it was found 
that the Pig {{UDFContext}} stored full-fat HCat {{InputJobInfo}} objects, thus 
blowing out the Pig Tez AM. Since this information is already stored in the 
{{HCatSplit}}, the serialization of {{InputJobInfo}} can be spared.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (HIVE-17669) Cache to optimize SearchArgument deserialization

2017-10-02 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-17669:
---

 Summary: Cache to optimize SearchArgument deserialization
 Key: HIVE-17669
 URL: https://issues.apache.org/jira/browse/HIVE-17669
 Project: Hive
  Issue Type: Improvement
  Components: ORC, Query Processor
Affects Versions: 2.2.0, 3.0.0
Reporter: Mithun Radhakrishnan
Assignee: Chris Drome


And another, from [~selinazh] and [~cdrome]. (YHIVE-927)

When a mapper needs to process multiple ORC files, it might land up having use 
essentially the same {{SearchArgument}} over several files. It would be good 
not to have to deserialize from string, over and over again. Caching the object 
against the string-form should speed things up.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (HIVE-17656) Hive settings are not passed to Orc/Avro SerDes, when used from HCatalog

2017-09-29 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-17656:
---

 Summary: Hive settings are not passed to Orc/Avro SerDes, when 
used from HCatalog
 Key: HIVE-17656
 URL: https://issues.apache.org/jira/browse/HIVE-17656
 Project: Hive
  Issue Type: Bug
  Components: HCatalog, Serializers/Deserializers
Affects Versions: 2.2.0, 3.0.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan


We find that tables/partitions accessed through {{HCatLoader}}/{{HCatStorer}} 
use Hive settings hard-coded in {{HiveConf}}, rather than settings in the 
{{hive-site.xml}}. For instance, ORC files written through Pig/HCatStorer use 
default {{orc.stripe.size}} settings, rather than any overrides in 
{{hive-site.xml}}.

This has to do with the way the {{HiveStorageHandler}} is constructed in the 
HCat path. I'll upload the fix shortly. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (HIVE-17621) Hive-site settings are ignored during HCatInputFormat split-calculation

2017-09-27 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-17621:
---

 Summary: Hive-site settings are ignored during HCatInputFormat 
split-calculation
 Key: HIVE-17621
 URL: https://issues.apache.org/jira/browse/HIVE-17621
 Project: Hive
  Issue Type: Bug
  Components: HCatalog
Affects Versions: 2.2.0, 3.0.0
Reporter: Mithun Radhakrishnan
Assignee: Chris Drome


Another one that [~selinazh] and [~cdrome] worked on.

The production {{hive-site.xml}} could well contain settings that differ from 
the defaults in {{HiveConf.java}}. In our case, we introduced a custom ORC 
split-strategy, which we introduced as the site-wide default.

We noticed that during {{HCatInputFormat::getSplits()}}, if the user-script did 
not contain the setting, the site-wide default was ignored in favour of the 
{{HiveConf}} default. HCat would not convey hive-site settings to the 
input-format (or anywhere downstream).

The forthcoming patch fixes this problem.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (HIVE-17612) Hive does not insert dynamic partition-sets atomically

2017-09-26 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-17612:
---

 Summary: Hive does not insert dynamic partition-sets atomically
 Key: HIVE-17612
 URL: https://issues.apache.org/jira/browse/HIVE-17612
 Project: Hive
  Issue Type: Improvement
  Components: CLI, Hive
Affects Versions: 2.2.0, 3.0.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan


If one inserts partitions to a Hive table using a Hive query (e.g. {{INSERT 
OVERWRITE TABLE my_table PARTITION (foo, bar) SELECT * FROM another_table;}}), 
each dynamic partition is added separately, using {{HMSC.append_partition()}}. 
By contrast, Pig/HCatLoader does the same atomically, using 
{{HMSC.add_partitions()}}.

Because of this behaviour, Oozie workflows might kick off when the first 
partition is registered, but before the last partition in the set is available.

This was verified in the metastore-logs, with multiple {{ADD_PARTITION}} events 
fired for the same query (i.e. once per added partition), instead of a single 
event for the set.

It would be ideal for Hive to provide atomic partition-adds.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (HIVE-17609) Tool to manipulate delegation tokens

2017-09-26 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-17609:
---

 Summary: Tool to manipulate delegation tokens
 Key: HIVE-17609
 URL: https://issues.apache.org/jira/browse/HIVE-17609
 Project: Hive
  Issue Type: Improvement
  Components: Metastore, Security
Affects Versions: 2.2.0, 3.0.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan


This was precipitated by OOZIE-2797. We had a case in production where the 
number of active metastore delegation tokens outstripped the ZooKeeper 
{{jute.maxBuffer}} size. Delegation tokens could neither be fetched, nor be 
cancelled. 

The root-cause turned out to be a miscommunication, causing delegation tokens 
fetched by Oozie *not* to be cancelled automatically from HCat. This was sorted 
out as part of OOZIE-2797.

The issue exposed how poor the log-messages were, in the code pertaining to 
token fetch/cancellation. We also found need for a tool to query/list/purge 
delegation tokens that might have expired already. This patch introduces such a 
tool, and improves the log-messages.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (HIVE-17600) Make OrcFile's "enforceBufferSize" user-settable.

2017-09-25 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-17600:
---

 Summary: Make OrcFile's "enforceBufferSize" user-settable.
 Key: HIVE-17600
 URL: https://issues.apache.org/jira/browse/HIVE-17600
 Project: Hive
  Issue Type: Improvement
  Components: ORC
Affects Versions: 2.2.0, 3.0.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan


This is a duplicate of ORC-238, but it applies to {{branch-2.2}}.

Compression buffer-sizes in OrcFile are computed at runtime, except when 
enforceBufferSize is set. The only snag here is that this flag can't be set by 
the user.
When runtime-computed buffer-sizes are not optimal (for some reason), the user 
has no way to work around it by setting a custom value.
I have a patch that we use at Yahoo.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (HIVE-17576) Improve progress-reporting in TezProcessor

2017-09-21 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-17576:
---

 Summary: Improve progress-reporting in TezProcessor
 Key: HIVE-17576
 URL: https://issues.apache.org/jira/browse/HIVE-17576
 Project: Hive
  Issue Type: Bug
Reporter: Mithun Radhakrishnan
Assignee: Chris Drome


Another one on behalf of [~selinazh] and [~cdrome]. Following the example in 
[Apache Tez's 
{{MapProcessor}}|https://github.com/apache/tez/blob/247719d7314232f680f028f4e1a19370ffb7b1bb/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/processor/map/MapProcessor.java#L88],
 {{TezProcessor}} ought to use {{ProgressHelper}} to report progress for a Tez 
task. As per [~kshukla]'s advice,

{quote}
Tez... provides {{getProgress()}} API for {{AbstractLogicalInput(s)}} which 
will give the correct progress value for a given Input. The TezProcessor(s) in 
Hive should use this to do something similar to what MapProcessor in Tez does 
today, which is use/override ProgressHelper to get the input progress and then 
set the progress on the processorContext.
...
The default behavior of the ProgressHelper class sets the processor progress to 
be the average of progress values from all inputs.
{quote}

This code is -whacked from- *inspired by* {{MapProcessor}}'s use of 
{{ProgressHelper}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (HIVE-17574) Avoid multiple copies of HDFS-based jars when localizing job-jars

2017-09-21 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-17574:
---

 Summary: Avoid multiple copies of HDFS-based jars when localizing 
job-jars
 Key: HIVE-17574
 URL: https://issues.apache.org/jira/browse/HIVE-17574
 Project: Hive
  Issue Type: Bug
Affects Versions: 2.2.0, 3.0.0, 2.4.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan


Raising this on behalf of [~selinazh]. (For my own reference: YHIVE-1035.)

This has to do with the classpaths of Hive actions run from Oozie, and affects 
scripts that adds jars/resources from HDFS locations.

As part of Oozie's "sharelib" deploys, foundation jars (such as Hive jars) tend 
to be stored in HDFS paths, as are any custom user-libraries used in workflows. 
An {{ADD JAR|FILE|ARCHIVE}} statement in a Hive script causes the following 
steps to occur:
# Files are downloaded from HDFS to local temp dir.
# UDFs are resolved/validated.
# All jars/files, including those just downloaded from HDFS, are shipped right 
back to HDFS-based scratch-directories, for job submission.

This is wasteful and time-consuming. #3 above should skip shipping HDFS-based 
resources, and add those directly to the Tez session.

We have a patch that's being used internally at Yahoo.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (HIVE-17489) Separate client-facing and server-side Kerberos principals, to support HA

2017-09-08 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-17489:
---

 Summary: Separate client-facing and server-side Kerberos 
principals, to support HA
 Key: HIVE-17489
 URL: https://issues.apache.org/jira/browse/HIVE-17489
 Project: Hive
  Issue Type: Bug
  Components: Metastore
Reporter: Mithun Radhakrishnan
Assignee: Thiruvel Thirumoolan


On deployments of the Hive metastore where a farm of servers is fronted by a 
VIP, the hostname of the VIP (e.g. {{mycluster-hcat.blue.myth.net}}) will 
differ from the actual boxen in the farm (.e.g 
{{mycluster-hcat-\[0..3\].blue.myth.net}}).

Such a deployment messes up Kerberos auth, with principals like 
{{hcat/mycluster-hcat.blue.myth@grid.myth.net}}. Host-based checks will 
disallow servers behind the VIP from using the VIP's hostname in its principal 
when accessing, say, HDFS.

The solution would be to decouple the server-side principal (used to access 
other services like HDFS as a client) from the client-facing principal (used 
from Hive-client, BeeLine, etc.).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (HIVE-17472) Drop-partition for multi-level partition fails, if data does not exist.

2017-09-06 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-17472:
---

 Summary: Drop-partition for multi-level partition fails, if data 
does not exist.
 Key: HIVE-17472
 URL: https://issues.apache.org/jira/browse/HIVE-17472
 Project: Hive
  Issue Type: Bug
  Components: Metastore
Reporter: Mithun Radhakrishnan
Assignee: Chris Drome


Raising this on behalf of [~cdrome] and [~selinazh]. 

Here's how to reproduce the problem:

{code:sql}
CREATE TABLE foobar ( foo STRING, bar STRING ) PARTITIONED BY ( dt STRING, 
region STRING ) STORED AS RCFILE LOCATION '/tmp/foobar';

ALTER TABLE foobar ADD PARTITION ( dt='1', region='A' ) ;

dfs -rm -R -skipTrash /tmp/foobar/dt=1;

ALTER TABLE foobar DROP PARTITION ( dt='1' );
{code}

This causes a client-side error as follows:
{code}
15/02/26 23:08:32 ERROR exec.DDLTask: 
org.apache.hadoop.hive.ql.metadata.HiveException: Unknown error. Please check 
logs.
{code}





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (HIVE-17467) HCatClient APIs for discovering partition key-values

2017-09-06 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-17467:
---

 Summary: HCatClient APIs for discovering partition key-values
 Key: HIVE-17467
 URL: https://issues.apache.org/jira/browse/HIVE-17467
 Project: Hive
  Issue Type: New Feature
  Components: HCatalog, Metastore
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan


This is a followup to HIVE-17466, which adds the {{HiveMetaStore}} level call 
to retrieve unique combinations of part-key values that satisfy a specified 
predicate.

Attached herewith are the {{HCatClient}} APIs that will be used by Apache 
Oozie, before launching workflows.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (HIVE-17466) Metastore API to list unique partition-key-value combinations

2017-09-06 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-17466:
---

 Summary: Metastore API to list unique partition-key-value 
combinations
 Key: HIVE-17466
 URL: https://issues.apache.org/jira/browse/HIVE-17466
 Project: Hive
  Issue Type: New Feature
  Components: Metastore
Affects Versions: 2.2.0, 3.0.0
Reporter: Mithun Radhakrishnan
Assignee: Thiruvel Thirumoolan


Raising this on behalf of [~thiruvel], who wrote this initially as part of a 
tangential "data-discovery" system.

Programs like Apache Oozie, Apache Falcon (or Yahoo GDM), etc. launch workflows 
based on the availability of table/partitions. Partitions are currently 
discovered by listing partitions using (what boils down to) 
{{HiveMetaStoreClient.listPartitions()}}. This can be slow and cumbersome, 
given that {{Partition}} objects are heavyweight and carry redundant 
information. The alternative is to use partition-names, which will need 
client-side parsing to extract part-key values.

When checking which hourly partitions for a particular day have been published 
already, it would be preferable to have an API that pushed down part-key 
extraction into the {{RawStore}} layer, and returned key-values as the result. 
This would be similar to how {{SELECT DISTINCT part_key FROM my_table;}} would 
run, but at the {{HiveMetaStoreClient}} level.

Here's what we've been using at Yahoo.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (HIVE-17273) MergeFileTask needs to be interruptible

2017-08-08 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-17273:
---

 Summary: MergeFileTask needs to be interruptible
 Key: HIVE-17273
 URL: https://issues.apache.org/jira/browse/HIVE-17273
 Project: Hive
  Issue Type: Bug
Affects Versions: 2.2.0, 3.0.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan


This is an extension to the work done in HIVE-16820 (which made {{TezTask}} 
exit correctly when the job is cancelled.)

If a Hive job involves a {{MergeFileTask}} (say {{ALTER TABLE ... PARTITION ... 
CONCATENATE}}), and is cancelled *after* the merge-task has kicked off, then 
the merge-task might not be cancelled, and might run through to completion.

The code should check if the merge-job has already been scheduled, and cancel 
it if required.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (HIVE-17233) Set "mapred.input.dir.recursive" for HCatInputFormat-based jobs.

2017-08-02 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-17233:
---

 Summary: Set "mapred.input.dir.recursive" for 
HCatInputFormat-based jobs.
 Key: HIVE-17233
 URL: https://issues.apache.org/jira/browse/HIVE-17233
 Project: Hive
  Issue Type: Bug
  Components: HCatalog
Affects Versions: 2.2.0, 3.0.0
Reporter: Mithun Radhakrishnan


This has to do with {{HIVE-15575}}. {{TezCompiler}} seems to set 
{{mapred.input.dir.recursive}} to {{true}}. This is acceptable for Hive jobs, 
since this allows Hive to consume its peculiar {{UNION ALL}} output, where the 
output of each relation is stored in a separate sub-directory of the output-dir.

For such output to be readable through HCatalog (via Pig/HCatLoader), 
{{mapred.input.dir.recursive}} should be set from {{HCatInputFormat}} as well. 
Otherwise, one gets zero records for that input.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (HIVE-17218) Canonical-ize hostnames for Hive metastore, and HS2 servers.

2017-07-31 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-17218:
---

 Summary: Canonical-ize hostnames for Hive metastore, and HS2 
servers.
 Key: HIVE-17218
 URL: https://issues.apache.org/jira/browse/HIVE-17218
 Project: Hive
  Issue Type: Bug
  Components: HiveServer2, Metastore, Security
Affects Versions: 2.2.0, 1.2.2, 3.0.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan


Currently, the {{HiveMetastoreClient}} and {{HiveConnection}} do not 
canonical-ize the hostnames of the metastore/HS2 servers. In deployments where 
there are multiple such servers behind a VIP, this causes a number of 
inconveniences:
# The client-side configuration (e.g. {{hive.metastore.uris}} in 
{{hive-site.xml}}) needs to specify the VIP's hostname, and cannot use a 
simplified CNAME, in the thrift URL. If the 
{{hive.metastore.kerberos.principal}} is specified using {{_HOST}}, one sees 
GSS failures as follows:
{noformat}
hive --hiveconf hive.metastore.kerberos.principal=hive/_h...@grid.myth.net 
--hiveconf 
hive.metastore.uris="thrift://simplified-hcat-cname.grid.myth.net:56789"
...
Exception in thread "main" java.lang.RuntimeException: 
java.lang.RuntimeException: Unable to instantiate 
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at 
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:542)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:677)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:621)
...
{noformat}
This is because {{_HOST}} is filled in with the CNAME, and not the 
canonicalized name.
# Oozie workflows that use HCat {{}} have to always use the VIP 
hostname, and can't use {{_HOST}}-based service principals, if the CNAME 
differs from the VIP name.

If the client-code simply canonical-ized the hostnames, it would enable the use 
of both simplified CNAMEs, and _HOST in service principals.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (HIVE-17201) (Temporarily) Disable failing tests in TestHCatClient

2017-07-28 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-17201:
---

 Summary: (Temporarily) Disable failing tests in TestHCatClient
 Key: HIVE-17201
 URL: https://issues.apache.org/jira/browse/HIVE-17201
 Project: Hive
  Issue Type: Bug
  Components: HCatalog, Tests
Affects Versions: 3.0.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan


This is with regard to the recent test-failures in {{TestHCatClient}}. 

While [~sbeeram] and I joust over the best way to rephrase the failing tests 
(in HIVE-16908), perhaps it's best that we temporarily disable the following 
failing tests:
{noformat}
org.apache.hive.hcatalog.api.TestHCatClient.testPartitionRegistrationWithCustomSchema
 (batchId=177)
org.apache.hive.hcatalog.api.TestHCatClient.testPartitionSpecRegistrationWithCustomSchema
 (batchId=177)
org.apache.hive.hcatalog.api.TestHCatClient.testTableSchemaPropagation 
(batchId=177)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (HIVE-17188) ObjectStore runs out of memory for large batches of addPartitions().

2017-07-27 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-17188:
---

 Summary: ObjectStore runs out of memory for large batches of 
addPartitions().
 Key: HIVE-17188
 URL: https://issues.apache.org/jira/browse/HIVE-17188
 Project: Hive
  Issue Type: Bug
  Components: Metastore
Affects Versions: 2.2.0
Reporter: Mithun Radhakrishnan
Assignee: Chris Drome


For large batches (e.g. hundreds) of {{addPartitions()}}, the {{ObjectStore}} 
runs out of memory. Flushing the {{PersistenceManager}} alleviates the problem.

(Raising this on behalf of [~cdrome] and [~thiruvel].)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (HIVE-17181) HCatOutputFormat should expose complete output-schema (including partition-keys) for dynamic-partitioning MR jobs

2017-07-26 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-17181:
---

 Summary: HCatOutputFormat should expose complete output-schema 
(including partition-keys) for dynamic-partitioning MR jobs
 Key: HIVE-17181
 URL: https://issues.apache.org/jira/browse/HIVE-17181
 Project: Hive
  Issue Type: Bug
  Components: HCatalog
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan


Map/Reduce jobs that use HCatalog APIs to write to Hive tables using Dynamic 
partitioning are expected to call the following API methods:
# {{HCatOutputFormat.setOutput()}} to indicate which table/partitions to write 
to. This call populates the {{OutputJobInfo}} with details fetched from the 
Metastore.
# {{HCatOutputFormat.setSchema()}} to indicate the output-schema for the data 
being written.

It is a common mistake to invoke {{HCatOUtputFormat.setSchema()}} as follows:
{code:java}
HCatOutputFormat.setSchema(conf, HCatOutputFormat.getTableSchema(conf));
{code}

Unfortunately, {{getTableSchema()}} returns only the record-schema, not the 
entire table's schema. We'll need a better API for use in M/R jobs to get the 
complete table-schema.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (HIVE-17169) Avoid call to KeyProvider::getMetadata()

2017-07-25 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-17169:
---

 Summary: Avoid call to KeyProvider::getMetadata()
 Key: HIVE-17169
 URL: https://issues.apache.org/jira/browse/HIVE-17169
 Project: Hive
  Issue Type: Bug
  Components: Shims
Affects Versions: 3.0.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan


Here's the code from {{Hadoop23Shims}}:

{code:title=Hadoop23Shims.java|borderStyle=solid}
@Override
public int comparePathKeyStrength(Path path1, Path path2) throws 
IOException {
  EncryptionZone zone1, zone2;

  zone1 = hdfsAdmin.getEncryptionZoneForPath(path1);
  zone2 = hdfsAdmin.getEncryptionZoneForPath(path2);

  if (zone1 == null && zone2 == null) {
return 0;
  } else if (zone1 == null) {
return -1;
  } else if (zone2 == null) {
return 1;
  }

  return compareKeyStrength(zone1.getKeyName(), zone2.getKeyName());
}

private int compareKeyStrength(String keyname1, String keyname2) throws 
IOException {
  KeyProvider.Metadata meta1, meta2;

  if (keyProvider == null) {
throw new IOException("HDFS security key provider is not configured on 
your server.");
  }

  meta1 = keyProvider.getMetadata(keyname1);
  meta2 = keyProvider.getMetadata(keyname2);

  if (meta1.getBitLength() < meta2.getBitLength()) {
return -1;
  } else if (meta1.getBitLength() == meta2.getBitLength()) {
return 0;
  } else {
return 1;
  }
}
  }
{code}

It turns out that {{EncryptionZone}} already has the cipher's bit-length stored 
in a member variable. One shouldn't need an additional name-node call 
({{KeyProvider::getMetadata()}}) only to fetch it again.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (HIVE-15575) ALTER TABLE CONCATENATE and hive.merge.tezfiles seems busted for UNION ALL output

2017-01-10 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-15575:
---

 Summary: ALTER TABLE CONCATENATE and hive.merge.tezfiles seems 
busted for UNION ALL output
 Key: HIVE-15575
 URL: https://issues.apache.org/jira/browse/HIVE-15575
 Project: Hive
  Issue Type: Bug
Reporter: Mithun Radhakrishnan
Priority: Critical


Hive {{UNION ALL}} produces data in sub-directories under the table/partition 
directories. E.g.

{noformat}
hive (mythdb_hadooppf_17544)> create table source ( foo string, bar string, goo 
string ) stored as textfile;
OK
Time taken: 0.322 seconds
hive (mythdb_hadooppf_17544)> create table results_partitioned( foo string, bar 
string, goo string ) partitioned by ( dt string ) stored as orcfile;
OK
Time taken: 0.322 seconds
hive (mythdb_hadooppf_17544)> set hive.merge.tezfiles=false; insert overwrite 
table results_partitioned partition( dt ) select 'goo', 'bar', 'foo', '1' from 
source UNION ALL select 'go', 'far', 'moo', '1' from source;
...
Loading data to table mythdb_hadooppf_17544.results_partitioned partition 
(dt=null)
 Time taken for load dynamic partitions : 311
Loading partition {dt=1}
 Time taken for adding to write entity : 3
OK
Time taken: 27.659 seconds
hive (mythdb_hadooppf_17544)> dfs -ls -R 
/tmp/mythdb_hadooppf_17544/results_partitioned;
drwxrwxrwt   - dfsload hdfs  0 2017-01-10 23:13 
/tmp/mythdb_hadooppf_17544/results_partitioned/dt=1
drwxrwxrwt   - dfsload hdfs  0 2017-01-10 23:13 
/tmp/mythdb_hadooppf_17544/results_partitioned/dt=1/1
-rwxrwxrwt   3 dfsload hdfs349 2017-01-10 23:13 
/tmp/mythdb_hadooppf_17544/results_partitioned/dt=1/1/00_0
drwxrwxrwt   - dfsload hdfs  0 2017-01-10 23:13 
/tmp/mythdb_hadooppf_17544/results_partitioned/dt=1/2
-rwxrwxrwt   3 dfsload hdfs368 2017-01-10 23:13 
/tmp/mythdb_hadooppf_17544/results_partitioned/dt=1/2/00_0
{noformat}

These results can only be read if {{mapred.input.dir.recursive=true}}, as 
{{TezCompiler::init()}} seems to do. But the Hadoop default for this is 
{{false}}. This leads to the following errors:
1. Running {{CONCATENATE}} on the partition on the partition causes data-loss.
{noformat}
hive --database mythdb_hadooppf_17544 -e " set mapred.input.dir.recursive; 
alter table results_partitioned partition ( dt='1' ) concatenate ; set 
mapred.input.dir.recursive; "
...
OK
Time taken: 2.151 seconds
mapred.input.dir.recursive=false


Status: Running (Executing on YARN cluster with App id 
application_1481756273279_5088754)


VERTICES  STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED

File Merge SUCCEEDED  0  000   0   0

VERTICES: 01/01  [>>--] 0%ELAPSED TIME: 0.35 s

Loading data to table mythdb_hadooppf_17544.results_partitioned partition (dt=1)
Moved: 
'hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/dt=1/1'
 to trash at: 
hdfs://cluster-nn1.mygrid.myth.net:8020/user/dfsload/.Trash/Current
Moved: 
'hdfs://cluster-nn1.mygrid.myth.net:8020/tmp/mythdb_hadooppf_17544/results_partitioned/dt=1/2'
 to trash at: 
hdfs://cluster-nn1.mygrid.myth.net:8020/user/dfsload/.Trash/Current
OK
Time taken: 25.873 seconds

$ hdfs dfs -count -h /tmp/mythdb_hadooppf_17544/results_partitioned/dt=1
   10  0 
/tmp/mythdb_hadooppf_17544/results_partitioned/dt=1
{noformat}

2. hive.merge.tezfiles is busted, because the merge-task attempts to merge 
files across {{results_partitioned/dt=1/1}} and {{results_partitioned/dt=1/2}}:
{noformat}
$ hive --database mythdb_hadooppf_17544 -e " set hive.merge.tezfiles=true; 
insert overwrite table results_partitioned partition( dt ) select 'goo', 'bar', 
'foo', '1' from source UNION ALL select 'go', 'far', 'moo', '1' from source; "
...
Query ID = dfsload_20170110233558_51289333-d9da-4851-8671-bfe653d26e45
Total jobs = 3
Launching Job 1 out of 3


Status: Running (Executing on YARN cluster with App id 
application_1481756273279_5089989)


VERTICES  STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED

Map 1 ..   SUCCEEDED  1  100   0   0
Map 3 ..   SUCCEEDED  1  100   0   0

VERTICES: 02/02  [==>>]

[jira] [Created] (HIVE-15491) Failures are masked/swallowed in GenericUDTFJSONTuple::process().

2016-12-21 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-15491:
---

 Summary: Failures are masked/swallowed in 
GenericUDTFJSONTuple::process().
 Key: HIVE-15491
 URL: https://issues.apache.org/jira/browse/HIVE-15491
 Project: Hive
  Issue Type: Bug
Reporter: Mithun Radhakrishnan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-14794) HCatalog support to pre-fetch for Avro tables that use avro.schema.url.

2016-09-19 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-14794:
---

 Summary: HCatalog support to pre-fetch for Avro tables that use 
avro.schema.url.
 Key: HIVE-14794
 URL: https://issues.apache.org/jira/browse/HIVE-14794
 Project: Hive
  Issue Type: Bug
  Components: HCatalog
Affects Versions: 2.1.0, 1.2.1
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan


HIVE-14792 introduces support to modify and add properties to table-parameters 
during query-planning. It prefetches remote Avro-schema information and stores 
it in TBLPROPERTIES, under {{avro.schema.literal}}.

We'll need similar support in {{HCatLoader}} to prevent excessive reads of 
schema-files in Pig queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-14792) AvroSerde reads the remote schema-file at least once per mapper, per table reference.

2016-09-19 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-14792:
---

 Summary: AvroSerde reads the remote schema-file at least once per 
mapper, per table reference.
 Key: HIVE-14792
 URL: https://issues.apache.org/jira/browse/HIVE-14792
 Project: Hive
  Issue Type: Bug
Affects Versions: 2.1.0, 1.2.1
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan


Avro tables that use "external" schema files stored on HDFS can cause excessive 
calls to {{FileSystem::open()}}, especially for queries that spawn large 
numbers of mappers.

This is because of the following code in {{AvroSerDe::initialize()}}:

{code:title=AvroSerDe.java|borderStyle=solid}
public void initialize(Configuration configuration, Properties properties) 
throws SerDeException {
// ...
if (hasExternalSchema(properties)
|| columnNameProperty == null || columnNameProperty.isEmpty()
|| columnTypeProperty == null || columnTypeProperty.isEmpty()) {
  schema = determineSchemaOrReturnErrorSchema(configuration, properties);
} else {
  // Get column names and sort order
  columnNames = Arrays.asList(columnNameProperty.split(","));
  columnTypes = 
TypeInfoUtils.getTypeInfosFromTypeString(columnTypeProperty);

  schema = getSchemaFromCols(properties, columnNames, columnTypes, 
columnCommentProperty);
 
properties.setProperty(AvroSerdeUtils.AvroTableProperties.SCHEMA_LITERAL.getPropName(),
 schema.toString());
}
// ...
}
{code}

For files using {{avro.schema.url}}, every time the SerDe is initialized (i.e. 
at least once per mapper), the schema file is read remotely. For queries with 
thousands of mappers, this leads to a stampede to the handful (3?) datanodes 
that host the schema-file. In the best case, this causes slowdowns.

It would be preferable to distribute the Avro-schema to all mappers as part of 
the job-conf. The alternatives aren't exactly appealing:
# One can't rely solely on the {{column.list.types}} stored in the Hive 
metastore. (HIVE-14789).
# {{avro.schema.literal}} might not always be usable, because of the size-limit 
on table-parameters. The typical size of the Avro-schema file is between 
0.5-3MB, in my limited experience. Bumping the max table-parameter size isn't a 
great solution.

If the {{avro.schema.file}} were read during query-planning, and made available 
as part of table-properties (but not serialized into the metastore), the 
downstream logic will remain largely intact. I have a patch that does this.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-14789) Avro Table-reads bork when using SerDe-generated table-schema.

2016-09-19 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-14789:
---

 Summary: Avro Table-reads bork when using SerDe-generated 
table-schema.
 Key: HIVE-14789
 URL: https://issues.apache.org/jira/browse/HIVE-14789
 Project: Hive
  Issue Type: Bug
  Components: Serializers/Deserializers
Affects Versions: 2.0.1, 1.2.1
Reporter: Mithun Radhakrishnan


AvroSerDe allows one to skip the table-columns in a table-definition when 
creating a table, as long as the TBLPROPERTIES includes a valid 
{{avro.schema.url}} or {{avro.schema.literal}}. The table-columns are inferred 
from processing the Avro schema file/literal.

The problem is that the inferred schema might not be congruent with the actual 
schema in the Avro schema file/literal. Consider the following table definition:

{code:sql}
CREATE TABLE avro_schema_break_1
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.literal'='{
  "type": "record",
  "name": "Messages",
  "namespace": "net.myth",
  "fields": [
{
  "name": "header",
  "type": [
"null",
{
  "type": "record",
  "name": "HeaderInfo",
  "fields": [
{
  "name": "inferred_event_type",
  "type": [
"null",
"string"
  ],
  "default": null
},
{
  "name": "event_type",
  "type": [
"null",
"string"
  ],
  "default": null
},
{
  "name": "event_version",
  "type": [
"null",
"string"
  ],
  "default": null
}
  ]
}
  ]
},
{
  "name": "messages",
  "type": {
"type": "array",
"items": {
  "name": "MessageInfo",
  "type": "record",
  "fields": [
{
  "name": "message_id",
  "type": [
"null",
"string"
  ],
  "doc": "Message-ID"
},
{
  "name": "received_date",
  "type": [
"null",
"long"
  ],
  "doc": "Received Date"
},
{
  "name": "sent_date",
  "type": [
"null",
"long"
  ]
},
{
  "name": "from_name",
  "type": [
"null",
"string"
  ]
},
{
  "name": "flags",
  "type": [
"null",
{
  "type": "record",
  "name": "Flags",
  "fields": [
{
  "name": "is_seen",
  "type": [
"null",
"boolean"
  ],
  "default": null
},
{
  "name": "is_read",
  "type": [
"null",
"boolean"
  ],
  "default": null
},
{
  "name": "is_flagged",
  "type": [
"null",
"boolean"
  ],
  "default": null
}
  ]
}
  ],
  "default": null
}
  ]
}
  }
}
  ]
}');
{code}

This produces a table with the following schema:
{noformat}
2016-09-19T13:23:42,934 DEBUG [0ce7e586-13ea-4390-ac2a-6dac36e8a216 main] 
hive.log: DDL: struct avro_schema_break_1 { 
struct 
header, 
list>>
 messages}
{noformat}

Data written to this table using the AvroSchema from {{avro.schema.literal}} 
using Pig's {{AvroStorage}} cannot be read using Hive using the generated table 
schema. This is the exception one sees:

{noformat}
java.io.IOException: org.apache.avro.AvroTypeException: Found 
net.myth.HeaderInfo, expecting union
  at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:521)
  at 
org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:428)
  at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:147)
  at

[jira] [Created] (HIVE-14380) Queries on tables with remote HDFS paths fail in "encryption" checks.

2016-07-28 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-14380:
---

 Summary: Queries on tables with remote HDFS paths fail in 
"encryption" checks.
 Key: HIVE-14380
 URL: https://issues.apache.org/jira/browse/HIVE-14380
 Project: Hive
  Issue Type: Bug
  Components: Encryption
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan


If a table has table/partition locations set to remote HDFS paths, querying 
them will cause the following IAException:

{noformat}
2016-07-26 01:16:27,471 ERROR parse.CalcitePlanner 
(SemanticAnalyzer.java:getMetaData(1867)) - 
org.apache.hadoop.hive.ql.metadata.HiveException: Unable to deter
mine if hdfs://foo.ygrid.yahoo.com:8020/projects/my_db/my_table is encrypted: 
java.lang.IllegalArgumentException: Wrong FS: 
hdfs://foo.ygrid.yahoo.com:8020/projects/my_db/my_table, expected: 
hdfs://bar.ygrid.yahoo.com:8020
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.isPathEncrypted(SemanticAnalyzer.java:2204)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getStrongestEncryptedTablePath(SemanticAnalyzer.java:2274)
...
{noformat}

This is because of the following code in {{SessionState}}:
{code:title=SessionState.java|borderStyle=solid}
 public HadoopShims.HdfsEncryptionShim getHdfsEncryptionShim() throws 
HiveException {
if (hdfsEncryptionShim == null) {
  try {
FileSystem fs = FileSystem.get(sessionConf);
if ("hdfs".equals(fs.getUri().getScheme())) {
  hdfsEncryptionShim = 
ShimLoader.getHadoopShims().createHdfsEncryptionShim(fs, sessionConf);
} else {
  LOG.debug("Could not get hdfsEncryptionShim, it is only applicable to 
hdfs filesystem.");
}
  } catch (Exception e) {
throw new HiveException(e);
  }
}

return hdfsEncryptionShim;
  }
{code}

When the {{FileSystem}} instance is created, using the {{sessionConf}} implies 
that the current HDFS is going to be used. This call should instead fetch the 
{{FileSystem}} instance corresponding to the path being checked.

A fix is forthcoming...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-14379) Queries on tables with remote HDFS paths fail in "encryption" checks.

2016-07-28 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-14379:
---

 Summary: Queries on tables with remote HDFS paths fail in 
"encryption" checks.
 Key: HIVE-14379
 URL: https://issues.apache.org/jira/browse/HIVE-14379
 Project: Hive
  Issue Type: Bug
  Components: Encryption
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan


If a table has table/partition locations set to remote HDFS paths, querying 
them will cause the following IAException:

{noformat}
2016-07-26 01:16:27,471 ERROR parse.CalcitePlanner 
(SemanticAnalyzer.java:getMetaData(1867)) - 
org.apache.hadoop.hive.ql.metadata.HiveException: Unable to deter
mine if hdfs://foo.ygrid.yahoo.com:8020/projects/my_db/my_table is encrypted: 
java.lang.IllegalArgumentException: Wrong FS: 
hdfs://foo.ygrid.yahoo.com:8020/projects/my_db/my_table, expected: 
hdfs://bar.ygrid.yahoo.com:8020
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.isPathEncrypted(SemanticAnalyzer.java:2204)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getStrongestEncryptedTablePath(SemanticAnalyzer.java:2274)
...
{noformat}

This is because of the following code in {{SessionState}}:
{code:title=SessionState.java|borderStyle=solid}
 public HadoopShims.HdfsEncryptionShim getHdfsEncryptionShim() throws 
HiveException {
if (hdfsEncryptionShim == null) {
  try {
FileSystem fs = FileSystem.get(sessionConf);
if ("hdfs".equals(fs.getUri().getScheme())) {
  hdfsEncryptionShim = 
ShimLoader.getHadoopShims().createHdfsEncryptionShim(fs, sessionConf);
} else {
  LOG.debug("Could not get hdfsEncryptionShim, it is only applicable to 
hdfs filesystem.");
}
  } catch (Exception e) {
throw new HiveException(e);
  }
}

return hdfsEncryptionShim;
  }
{code}

When the {{FileSystem}} instance is created, using the {{sessionConf}} implies 
that the current HDFS is going to be used. This call should instead fetch the 
{{FileSystem}} instance corresponding to the path being checked.

A fix is forthcoming...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-14274) When columns are added to structs in a Hive table, HCatLoader breaks.

2016-07-18 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-14274:
---

 Summary: When columns are added to structs in a Hive table, 
HCatLoader breaks.
 Key: HIVE-14274
 URL: https://issues.apache.org/jira/browse/HIVE-14274
 Project: Hive
  Issue Type: Bug
  Components: HCatalog
Affects Versions: 2.1.0, 1.2.1
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan


Consider this sequence of table/partition creation and schema evolution:
{code:sql}
-- Create table.
CREATE EXTERNAL TABLE `simple_text` (
foo STRING,
bar STRUCT
)
PARTITIONED BY ( dt STRING )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY ':'
STORED AS TEXTFILE ;

-- Add partition.
ALTER TABLE simple_text ADD PARTITION ( dt='0' );

-- Alter the struct-column to add a new sub-field.
ALTER TABLE simple_text CHANGE COLUMN bar bar STRUCT;
{code}

The {{dt='0'}} partition's schema indicates 2 fields in {{bar}}. The data can 
be read using Hive, but not through HCatLoader. The error looks as follows:

{noformat}
org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while 
executing (Name: data_raw: 
Store(hdfs://dilithiumblue-nn1.blue.ygrid.yahoo.com:8020/tmp/temp-643668868/tmp-1639945319:org.apache.pig.impl.io.TFileStorage)
 - scope-1 Operator Key: scope-1): 
org.apache.pig.backend.executionengine.ExecException: ERROR 0: 
org.apache.pig.backend.executionengine.ExecException: ERROR 6018: Error 
converting read value to tuple
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:314)
at 
org.apache.pig.backend.hadoop.executionengine.tez.plan.operator.POStoreTez.getNextTuple(POStoreTez.java:123)
at 
org.apache.pig.backend.hadoop.executionengine.tez.runtime.PigProcessor.runPipeline(PigProcessor.java:376)
at 
org.apache.pig.backend.hadoop.executionengine.tez.runtime.PigProcessor.run(PigProcessor.java:241)
at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:362)
at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179)
at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171)
at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 0: 
org.apache.pig.backend.executionengine.ExecException: ERROR 6018: Error 
converting read value to tuple
at 
org.apache.pig.backend.hadoop.executionengine.tez.plan.operator.POSimpleTezLoad.getNextTuple(POSimpleTezLoad.java:160)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:305)
... 16 more
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 6018: 
Error converting read value to tuple
at 
org.apache.hive.hcatalog.pig.HCatBaseLoader.getNext(HCatBaseLoader.java:76)
at org.apache.hive.hcatalog.pig.HCatLoader.getNext(HCatLoader.java:63)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:204)
at 
org.apache.tez.mapreduce.lib.MRReaderMapReduce.next(MRReaderMapReduce.java:118)
at 
org.apache.pig.backend.hadoop.executionengine.tez.plan.operator.POSimpleTezLoad.getNextTuple(POSimpleTezLoad.java:140)
... 17 more
Caused by: java.lang.IndexOutOfBoundsException: Index: 2, Size: 2
at java.util.ArrayList.rangeCheck(ArrayList.java:653)
at java.util.ArrayList.get(ArrayList.java:429)
at 
org.apache.hive.hcatalog.pig.PigHCatUtil.transformToTuple(PigHCatUtil.java:468)
at 
org.apache.hive.hcatalog.pig.PigHCatUtil.transformToTuple(PigHCatUtil.java:451)
at 
org.apache.hive.hcatalog.pig.PigHCatUtil.extractPigObject(PigHCatUtil.java:410)
at 
org.apache.hive.hcatalog.pig.PigHCatUtil.transformToTuple(PigHCatUtil.java:468)
at

[jira] [Created] (HIVE-12734) Remove redundancy in HiveConfs serialized to UDFContext

2015-12-22 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-12734:
---

 Summary: Remove redundancy in HiveConfs serialized to UDFContext
 Key: HIVE-12734
 URL: https://issues.apache.org/jira/browse/HIVE-12734
 Project: Hive
  Issue Type: Bug
Affects Versions: 1.2.1, 2.0.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan


{{HCatLoader}} lands up serializing one {{HiveConf}} instance per table-alias, 
to Pig's {{UDFContext}}. This lands up bloating the {{UDFContext}}.

To reduce the footprint, it makes sense to serialize a default-constructed 
{{HiveConf}} once, and one "diff" per {{HCatLoader}}. This should reduce the 
time taken to kick off jobs from {{pig -useHCatalog}} scripts.

(Note_to_self: YHIVE-540).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-12627) Hadoop23Shims.runDistCp() skips CRC checks.

2015-12-08 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-12627:
---

 Summary: Hadoop23Shims.runDistCp() skips CRC checks.
 Key: HIVE-12627
 URL: https://issues.apache.org/jira/browse/HIVE-12627
 Project: Hive
  Issue Type: Bug
Reporter: Mithun Radhakrishnan


{{Hadoop23Shims.runDistCp()}} seems to be skipping CRC-checks. That setting 
opens the door to bad data copy/commit. Is there a reason why we're doing this?

It's possible that if the final path is a file-system whose default block-sizes 
differ from the source, the checksum-checks for the copy could fail. But since 
we're preserving the files' block-sizes, this shouldn't be a concern.

Why are we skipping checksum checks? Can that be removed?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-11790) HCatLoader documentation refers to deprecated package.

2015-09-10 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-11790:
---

 Summary: HCatLoader documentation refers to deprecated package.
 Key: HIVE-11790
 URL: https://issues.apache.org/jira/browse/HIVE-11790
 Project: Hive
  Issue Type: Bug
Reporter: Mithun Radhakrishnan
Priority: Trivial


The [HCatLoader documentation 
page|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=34013511] 
seems to refer to {{org.apache.hcatalog.pig.HCatLoader}} instead of 
{{org.apache.hive.hcatalog.pig.HCatLoader}}. (Similarly, the {{HCatStorer}} 
documentation might need change.) The old package was deprecated and removed in 
Hive 0.13.

Let's change the documentation to reflect the new package-name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-11548) HCatLoader should support predicate pushdown.

2015-08-13 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-11548:
---

 Summary: HCatLoader should support predicate pushdown.
 Key: HIVE-11548
 URL: https://issues.apache.org/jira/browse/HIVE-11548
 Project: Hive
  Issue Type: New Feature
  Components: HCatalog
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan


When one uses {{HCatInputFormat}}/{{HCatLoader}} to read from file-formats that 
support predicate pushdown (such as ORC, with 
{{hive.optimize.index.filter=true}}), one sees that the predicates aren't 
actually pushed down into the storage layer.

The forthcoming patch should allow for filter-pushdown, if any of the 
partitions being scanned with {{HCatLoader}} support the functionality. The 
patch should technically allow the same for users of {{HCatInputFormat}}, but I 
don't currently have a neat interface to build a compound predicate-expression. 
Will add this separately, if required.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-11470) NPE in DynamicPartFileRecordWriterContainer on null part-keys.

2015-08-05 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-11470:
---

 Summary: NPE in DynamicPartFileRecordWriterContainer on null 
part-keys.
 Key: HIVE-11470
 URL: https://issues.apache.org/jira/browse/HIVE-11470
 Project: Hive
  Issue Type: Bug
  Components: HCatalog
Affects Versions: 1.2.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan


When partitioning data using {{HCatStorer}}, one sees the following NPE, if the 
dyn-part-key is of null-value:

{noformat}
2015-07-30 23:59:59,627 WARN [main] org.apache.hadoop.mapred.YarnChild: 
Exception running child : java.io.IOException: java.lang.NullPointerException
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:473)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:436)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:416)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:256)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1694)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.NullPointerException
at 
org.apache.hive.hcatalog.mapreduce.DynamicPartitionFileRecordWriterContainer.getLocalFileWriter(DynamicPartitionFileRecordWriterContainer.java:141)
at 
org.apache.hive.hcatalog.mapreduce.FileRecordWriterContainer.write(FileRecordWriterContainer.java:110)
at 
org.apache.hive.hcatalog.mapreduce.FileRecordWriterContainer.write(FileRecordWriterContainer.java:54)
at org.apache.hive.hcatalog.pig.HCatBaseStorer.putNext(HCatBaseStorer.java:309)
at org.apache.hive.hcatalog.pig.HCatStorer.putNext(HCatStorer.java:61)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:139)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:98)
at 
org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:558)
at 
org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
at 
org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.write(WrappedReducer.java:105)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:471)
... 11 more
{noformat}

The reason is that the {{DynamicPartitionFileRecordWriterContainer}} makes an 
unfortunate assumption when fetching a local file-writer instance:

{code:title=DynamicPartitionFileRecordWriterContainer.java}
  @Override
  protected LocalFileWriter getLocalFileWriter(HCatRecord value) 
throws IOException, HCatException {

OutputJobInfo localJobInfo = null;
// Calculate which writer to use from the remaining values - this needs to
// be done before we delete cols.
ListString dynamicPartValues = new ArrayListString();
for (Integer colToAppend : dynamicPartCols) {
  dynamicPartValues.add(value.get(colToAppend).toString()); // -- YIKES!
}
...
  }
{code}

Must check for null, and substitute with {{\_\_HIVE_DEFAULT_PARTITION\_\_}}, 
or equivalent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-11475) Bad rename of directory during commit, when using HCat dynamic-partitioning.

2015-08-05 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-11475:
---

 Summary: Bad rename of directory during commit, when using HCat 
dynamic-partitioning.
 Key: HIVE-11475
 URL: https://issues.apache.org/jira/browse/HIVE-11475
 Project: Hive
  Issue Type: Bug
  Components: HCatalog
Affects Versions: 1.2.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
Priority: Critical


Here's one that [~knoguchi] found and root-caused. This one's a doozy. 

Under seemingly random conditions, the temporary output (under 
{{_SCRATCH1.234*}}) for HCat's dynamic partitioner isn't promoted correctly to 
the final table directory.

The namenode logs indicated a botched directory-rename:

{noformat}
2015-08-02 03:24:29,090 INFO FSNamesystem.audit: allowed=true ugi=myth 
(auth:TOKEN) via wrkf...@grid.myth.net (auth:TOKEN) ip=/10.192.100.117 
cmd=rename 
src=/projects/hive/myth.db/myth_table_15m/_SCRATCH2.8772158158263395E-4/tc=1/utc_time=201508020145/part-r-0
 
dst=/projects/hive/myth.db/myth_table_15mE-4/tc=1/utc_time=201508020145/part-r-0
 perm=myth:madcaps:rw-r-r- proto=rpc
{noformat}

Note that the table-directory name {{myth_table_15m}} is appended with 
{{E-4}}. This'll break anything that uses HDFS-based polling.

[~knoguchi] points out the following code:

{code:title=HCatOutputFormat.java}
119   if ((idHash = conf.get(HCatConstants.HCAT_OUTPUT_ID_HASH)) == null) {
120 idHash = String.valueOf(Math.random());
121   }
{code}

{code:title=FileOutputCommitterContainer.java}
370   String finalLocn = jobLocation.replaceAll(Path.SEPARATOR + 
SCRATCH_DIR_NAME + \\d\\.?\\d+,);
{code}

The problem is that when {{Math.random()}} produces a number = 10 ^-3^, 
{{String.valueOf(double)}} uses exponential notation. The regex doesn't capture 
or handle this notation.

The fix belies the debugging-effort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-10598) Vectorization borks when column is added to table.

2015-05-04 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-10598:
---

 Summary: Vectorization borks when column is added to table.
 Key: HIVE-10598
 URL: https://issues.apache.org/jira/browse/HIVE-10598
 Project: Hive
  Issue Type: Bug
  Components: Vectorization
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan


Consider the following table definition:
{code:sql}
create table foobar ( foo string, bar string ) partitioned by (dt string) 
stored as orc;
alter table foobar add partition( dt='20150101' ) ;
{code}
Say the partition has the following data:
{noformat}
1   one 20150101
2   two 20150101
3   three   20150101
{noformat}
If a new column is added to the table-schema (and the partition continues to 
have the old schema), vectorized read from the old partitions fail thus:
{code:sql}
alter table foobar add columns( goo string );
select count(1) from foobar;
{code}

{code:title=stacktrace}
java.lang.Exception: java.lang.RuntimeException: Error creating a batch
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: java.lang.RuntimeException: Error creating a batch
at 
org.apache.hadoop.hive.ql.io.orc.VectorizedOrcInputFormat$VectorizedOrcRecordReader.createValue(VectorizedOrcInputFormat.java:114)
at 
org.apache.hadoop.hive.ql.io.orc.VectorizedOrcInputFormat$VectorizedOrcRecordReader.createValue(VectorizedOrcInputFormat.java:52)
at 
org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.createValue(CombineHiveRecordReader.java:84)
at 
org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.createValue(CombineHiveRecordReader.java:42)
at 
org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.createValue(HadoopShimsSecure.java:156)
at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.createValue(MapTask.java:180)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: No type entry 
found for column 3 in map {4=Long}
at 
org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatchCtx.addScratchColumnsToBatch(VectorizedRowBatchCtx.java:632)
at 
org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatchCtx.createVectorizedRowBatch(VectorizedRowBatchCtx.java:343)
at 
org.apache.hadoop.hive.ql.io.orc.VectorizedOrcInputFormat$VectorizedOrcRecordReader.createValue(VectorizedOrcInputFormat.java:112)
... 14 more
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-10492) HCatClient.dropPartitions() should check its partition-spec arguments.

2015-04-26 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-10492:
---

 Summary: HCatClient.dropPartitions() should check its 
partition-spec arguments.
 Key: HIVE-10492
 URL: https://issues.apache.org/jira/browse/HIVE-10492
 Project: Hive
  Issue Type: Bug
  Components: API, HCatalog
Affects Versions: 1.1.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan


{{HCatClient.dropPartitions()}} doesn't check the arguments in the 
partition-spec. This can lead to a {{RuntimeException}} when partition-keys are 
specified incorrectly.

We should check the arguments _a priori_ and throw a descriptive 
{{IllegalArgumentException}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-10420) Black-list for table-properties in replicated-tables.

2015-04-21 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-10420:
---

 Summary: Black-list for table-properties in replicated-tables.
 Key: HIVE-10420
 URL: https://issues.apache.org/jira/browse/HIVE-10420
 Project: Hive
  Issue Type: Bug
  Components: HCatalog, Metastore
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan


(Not essential for 1.2 release, although this'll be good to have.)
When table-schema changes are propagated between 2 HiveMetastore/HCatalog 
instances (using {{HCatTable.diff()}} and {{HCatTable.resolve()}}, some table 
properties are replicated identically, even though those properties might be 
specific to the source-table (or source-metastore).

For instance,
# Last update/DDL time
# JMS message coordinates
# Whether or not the table is external (ideally)

We should run the replication properties through a black-list filter, and have 
these removed when generating diffs, or replicating tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-10250) Optimize AuthorizationPreEventListener to reuse TableWrapper objects

2015-04-07 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-10250:
---

 Summary: Optimize AuthorizationPreEventListener to reuse 
TableWrapper objects
 Key: HIVE-10250
 URL: https://issues.apache.org/jira/browse/HIVE-10250
 Project: Hive
  Issue Type: Bug
  Components: Authorization
Reporter: Mithun Radhakrishnan


Here's the {{PartitionWrapper}} class in {{AuthorizationPreEventListener}}:
{code:java|title=AuthorizationPreEventListener.java}
 public static class PartitionWrapper extends 
org.apache.hadoop.hive.ql.metadata.Partition {
...
public PartitionWrapper(org.apache.hadoop.hive.metastore.api.Partition 
mapiPart, PreEventContext context) throws ... {
 Partition wrapperApiPart   = mapiPart.deepCopy();
 Table t = context.getHandler().get_table_core(
 mapiPart.getDbName(), 
 mapiPart.getTableName());
...
}
{code}

{{PreAddPartitionEvent}} (and soon, {{PreDropPartitionEvent}}) correspond not 
just to a single partition, but an entire set of partitions added atomically. 
When the event is authorized, {{HMSHandler.get_table_core()}} will be called 
once for every partition in the Event instance.

Since we already make the assumption that the partition-sets correspond to a 
single table, we might as well make a single call.

I'll have a patch for this, shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-10213) MapReduce jobs using dynamic-partitioning fail on commit.

2015-04-03 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-10213:
---

 Summary: MapReduce jobs using dynamic-partitioning fail on commit.
 Key: HIVE-10213
 URL: https://issues.apache.org/jira/browse/HIVE-10213
 Project: Hive
  Issue Type: Bug
  Components: HCatalog
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan


I recently ran into a problem in {{TaskCommitContextRegistry}}, when using 
dynamic-partitions.

Consider a MapReduce program that reads HCatRecords from a table (using 
HCatInputFormat), and then writes to another table (with identical schema), 
using HCatOutputFormat. The Map-task fails with the following exception:

{code}
Error: java.io.IOException: No callback registered for 
TaskAttemptID:attempt_1426589008676_509707_m_00_0@hdfs://crystalmyth.myth.net:8020/user/mithunr/mythdb/target/_DYN0.6784154320609959/grid=__HIVE_DEFAULT_PARTITION__/dt=__HIVE_DEFAULT_PARTITION__
at 
org.apache.hive.hcatalog.mapreduce.TaskCommitContextRegistry.commitTask(TaskCommitContextRegistry.java:56)
at 
org.apache.hive.hcatalog.mapreduce.FileOutputCommitterContainer.commitTask(FileOutputCommitterContainer.java:139)
at org.apache.hadoop.mapred.Task.commit(Task.java:1163)
at org.apache.hadoop.mapred.Task.done(Task.java:1025)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:345)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1694)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
{code}

{{TaskCommitContextRegistry::commitTask()}} uses call-backs registered from 
{{DynamicPartitionFileRecordWriter}}. But in case {{HCatInputFormat}} and 
{{HCatOutputFormat}} are both used in the same job, the 
{{DynamicPartitionFileRecordWriter}} might only be exercised in the Reducer.

I'm relaxing the IOException, and log a warning message instead of just failing.
(I'll post the fix shortly.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-9322) Make null-checks consistent for MapObjectInspector subclasses.

2015-02-20 Thread Mithun Radhakrishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-9322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329418#comment-14329418
 ] 

Mithun Radhakrishnan commented on HIVE-9322:


@[~ashutoshc]: You're right about the deja vu. :]
https://issues.apache.org/jira/browse/HIVE-6389?focusedCommentId=13917716page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13917716

(The problem in HIVE-6389 was that we were returning -1, when the data was 
NULL, even if it wasn't an integer-map. We discussed the data vs key null-check 
as an aside.)

At the moment, the semantics aren't uniform across OIs. {{LazyBinaryMapOI}} and 
{{DeepParquetHiveMapOI}} already guard against null-keys, while the others 
don't. Wouldn't uniformity be best? In light of your performance concern, 
should we consider removing the null-checks in all MapOIs?

I don't think we're changing semantics of what can be stored in a Map because 
I'd expect an NPE when writing a null-key (although I might be mistaken). We're 
only guarding against non-deterministic behaviour for stuff like:

{code:sql}
SELECT map_column[ string_column ] FROM my_table; 
{code}

... in cases where {{string_column IS NULL}}.

 Make null-checks consistent for MapObjectInspector subclasses.
 --

 Key: HIVE-9322
 URL: https://issues.apache.org/jira/browse/HIVE-9322
 Project: Hive
  Issue Type: Bug
  Components: Serializers/Deserializers
Affects Versions: 0.14.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
Priority: Minor
 Attachments: HIVE-9322.1.patch


 {{LazyBinaryMapObjectInspector}}, {{DeepParquetHiveMapInspector}}, etc. check 
 both the map-column value and the map-key for null, before dereferencing 
 them. {{OrcMapObjectInspector}} and {{LazyMapObjectInspector}} do not.
 This patch brings them all in sync. Might not be a real problem, unless (for 
 example) the lookup key is itself a (possibly null) value from another column.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-9674) *DropPartitionEvent should handle partition-sets.

2015-02-19 Thread Mithun Radhakrishnan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-9674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9674:
---
Attachment: (was: HIVE-9674.1.patch)

 *DropPartitionEvent should handle partition-sets.
 -

 Key: HIVE-9674
 URL: https://issues.apache.org/jira/browse/HIVE-9674
 Project: Hive
  Issue Type: Bug
  Components: Metastore
Affects Versions: 0.14.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan

 Dropping a set of N partitions from a table currently results in N 
 DropPartitionEvents (and N PreDropPartitionEvents) being fired serially. This 
 is wasteful, especially so for large N. It also makes it impossible to even 
 try to run authorization-checks on all partitions in a batch.
 Taking the cue from HIVE-9609, we should compose an {{IterablePartition}} 
 in the event, and expose them via an {{Iterator}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-9736) StorageBasedAuthProvider should batch namenode-calls where possible.

2015-02-19 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-9736:
--

 Summary: StorageBasedAuthProvider should batch namenode-calls 
where possible.
 Key: HIVE-9736
 URL: https://issues.apache.org/jira/browse/HIVE-9736
 Project: Hive
  Issue Type: Bug
  Components: Metastore, Security
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan


Consider a table partitioned by 2 keys (dt, region). Say a dt partition could 
have 1 associated regions. Consider that the user does:
{code:sql}
ALTER TABLE my_table DROP PARTITION (dt='20150101');
{code}

As things stand now, {{StorageBasedAuthProvider}} will make individual 
{{DistributedFileSystem.listStatus()}} calls for each partition-directory, and 
authorize each one separately. It'd be faster to batch the calls, and examine 
multiple FileStatus objects at once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-9674) *DropPartitionEvent should handle partition-sets.

2015-02-19 Thread Mithun Radhakrishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-9674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328321#comment-14328321
 ] 

Mithun Radhakrishnan commented on HIVE-9674:


Sorry, this patch isn't ready. {{PreDropPartitionEvent}}'s constructor doesn't 
take {{IterablePartition}}. Also, {{HiveMetaStore}} should be sending a 
single {{PreDropPartitionEvent}}, instead of one per partition. Will update the 
patch shortly.

 *DropPartitionEvent should handle partition-sets.
 -

 Key: HIVE-9674
 URL: https://issues.apache.org/jira/browse/HIVE-9674
 Project: Hive
  Issue Type: Bug
  Components: Metastore
Affects Versions: 0.14.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-9674.1.patch


 Dropping a set of N partitions from a table currently results in N 
 DropPartitionEvents (and N PreDropPartitionEvents) being fired serially. This 
 is wasteful, especially so for large N. It also makes it impossible to even 
 try to run authorization-checks on all partitions in a batch.
 Taking the cue from HIVE-9609, we should compose an {{IterablePartition}} 
 in the event, and expose them via an {{Iterator}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-9609) AddPartitionMessage.getPartitions() can return null

2015-02-19 Thread Mithun Radhakrishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-9609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328471#comment-14328471
 ] 

Mithun Radhakrishnan commented on HIVE-9609:


Also also, could we please change 
{{AuthorizationPreEventListener.authorizeAddPartition}} to use 
{{PartitionWrapper}}'s first constructor, so that the table isn't fetched every 
time?

 AddPartitionMessage.getPartitions() can return null
 ---

 Key: HIVE-9609
 URL: https://issues.apache.org/jira/browse/HIVE-9609
 Project: Hive
  Issue Type: Bug
Affects Versions: 1.2.0
Reporter: Sushanth Sowmyan
Assignee: Sushanth Sowmyan
 Attachments: HIVE-9609.2.patch, HIVE-9609.patch


 DbNotificationListener and NotificationListener both depend on 
 AddPartitionEvent.getPartitions() to get their partitions to trigger a 
 message, but this can be null if an AddPartitionEvent was initialized on a 
 PartitionSpec rather than a ListPartition.
 Also, AddPartitionEvent seems to have a duality, where getPartitions() works 
 only if instantiated on a ListPartition, and getPartitionIterator() works 
 only if instantiated on a PartitionSpec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-9681) Extend HiveAuthorizationProvider to support partition-sets.

2015-02-19 Thread Mithun Radhakrishnan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-9681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9681:
---
Attachment: HIVE-9681.1.patch

Here's a proposal.

 Extend HiveAuthorizationProvider to support partition-sets.
 ---

 Key: HIVE-9681
 URL: https://issues.apache.org/jira/browse/HIVE-9681
 Project: Hive
  Issue Type: Bug
  Components: Security
Affects Versions: 0.14.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-9681.1.patch


 {{HiveAuthorizationProvider}} allows only for the authorization of a single 
 partition at a time. For instance, when the {{StorageBasedAuthProvider}} must 
 authorize an operation on a set of partitions (say from a 
 PreDropPartitionEvent), each partition's data-directory needs to be checked 
 individually. For N partitions, this results in N namenode calls.
 I'd like to add {{authorize()}} overloads that accept multiple partitions. 
 This will allow StorageBasedAuthProvider to make batched namenode calls. 
 P.S. There's 2 further optimizations that are possible:
 1. In the ideal case, we'd have a single call in 
 {{org.apache.hadoop.fs.FileSystem}} to check access for an array of Paths, 
 something like:
 {code:title=FileSystem.java|borderStyle=solid}
 @InterfaceAudience.LimitedPrivate({HDFS, Hive})
   public void access(Path [] paths, FsAction mode) throws 
 AccessControlException, FileNotFoundException, IOException 
 {...}
 {code}
 2. We can go one better if we could retrieve partition-locations in DirectSQL 
 and use those for authorization. The EventListener-abstraction behind which 
 the AuthProviders operate make this difficult. I can attempt to solve this 
 using a PartitionSpec and a call-back into the ObjectStore from 
 StorageBasedAuthProvider. I'll save this rigmarole for later.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-9674) *DropPartitionEvent should handle partition-sets.

2015-02-19 Thread Mithun Radhakrishnan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-9674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9674:
---
Attachment: HIVE-9674.2.patch

Here's the fixed patch. I've changed HiveMetaStore to send a single 
DropPartitionEvent.

Also, please note that by using 
{{AuthorizationPreEventListener.PartitionWrapper}}'s first constructor, we 
avoid multiple calls to {{HMSHandler.get_table_core()}}. This really adds up. 
My perf tests indicate huge savings here. 

We should consider a similar change in HIVE-9609.

 *DropPartitionEvent should handle partition-sets.
 -

 Key: HIVE-9674
 URL: https://issues.apache.org/jira/browse/HIVE-9674
 Project: Hive
  Issue Type: Bug
  Components: Metastore
Affects Versions: 0.14.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-9674.2.patch


 Dropping a set of N partitions from a table currently results in N 
 DropPartitionEvents (and N PreDropPartitionEvents) being fired serially. This 
 is wasteful, especially so for large N. It also makes it impossible to even 
 try to run authorization-checks on all partitions in a batch.
 Taking the cue from HIVE-9609, we should compose an {{IterablePartition}} 
 in the event, and expose them via an {{Iterator}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-9674) *DropPartitionEvent should handle partition-sets.

2015-02-12 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-9674:
--

 Summary: *DropPartitionEvent should handle partition-sets.
 Key: HIVE-9674
 URL: https://issues.apache.org/jira/browse/HIVE-9674
 Project: Hive
  Issue Type: Bug
  Components: Metastore
Affects Versions: 0.14.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan


Dropping a set of N partitions from a table currently results in N 
DropPartitionEvents (and N PreDropPartitionEvents) being fired serially. This 
is wasteful, especially so for large N. It also makes it impossible to even try 
to run authorization-checks on all partitions in a batch.

Taking the cue from HIVE-9603, we should compose an {{IterablePartition}} in 
the event, and expose them via an {{Iterator}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-9674) *DropPartitionEvent should handle partition-sets.

2015-02-12 Thread Mithun Radhakrishnan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-9674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9674:
---
Description: 
Dropping a set of N partitions from a table currently results in N 
DropPartitionEvents (and N PreDropPartitionEvents) being fired serially. This 
is wasteful, especially so for large N. It also makes it impossible to even try 
to run authorization-checks on all partitions in a batch.

Taking the cue from HIVE-9609, we should compose an {{IterablePartition}} in 
the event, and expose them via an {{Iterator}}.

  was:
Dropping a set of N partitions from a table currently results in N 
DropPartitionEvents (and N PreDropPartitionEvents) being fired serially. This 
is wasteful, especially so for large N. It also makes it impossible to even try 
to run authorization-checks on all partitions in a batch.

Taking the cue from HIVE-9603, we should compose an {{IterablePartition}} in 
the event, and expose them via an {{Iterator}}.


 *DropPartitionEvent should handle partition-sets.
 -

 Key: HIVE-9674
 URL: https://issues.apache.org/jira/browse/HIVE-9674
 Project: Hive
  Issue Type: Bug
  Components: Metastore
Affects Versions: 0.14.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan

 Dropping a set of N partitions from a table currently results in N 
 DropPartitionEvents (and N PreDropPartitionEvents) being fired serially. This 
 is wasteful, especially so for large N. It also makes it impossible to even 
 try to run authorization-checks on all partitions in a batch.
 Taking the cue from HIVE-9609, we should compose an {{IterablePartition}} 
 in the event, and expose them via an {{Iterator}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-9674) *DropPartitionEvent should handle partition-sets.

2015-02-12 Thread Mithun Radhakrishnan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-9674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9674:
---
Attachment: HIVE-9674.1.patch

Here's a patch that builds on top of [~sushanth]'s work in HIVE-9609. (Can't 
submit until that one's in.)

 *DropPartitionEvent should handle partition-sets.
 -

 Key: HIVE-9674
 URL: https://issues.apache.org/jira/browse/HIVE-9674
 Project: Hive
  Issue Type: Bug
  Components: Metastore
Affects Versions: 0.14.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-9674.1.patch


 Dropping a set of N partitions from a table currently results in N 
 DropPartitionEvents (and N PreDropPartitionEvents) being fired serially. This 
 is wasteful, especially so for large N. It also makes it impossible to even 
 try to run authorization-checks on all partitions in a batch.
 Taking the cue from HIVE-9609, we should compose an {{IterablePartition}} 
 in the event, and expose them via an {{Iterator}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-9679) Remove redundant null-checks from DbNotificationListener.

2015-02-12 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-9679:
--

 Summary: Remove redundant null-checks from DbNotificationListener.
 Key: HIVE-9679
 URL: https://issues.apache.org/jira/browse/HIVE-9679
 Project: Hive
  Issue Type: Bug
Affects Versions: 0.14.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
Priority: Minor


There's a couple of unnecessary null-checks in {{DbNotificationListener}}. 
There's no way they'd fire. Shall we remove these?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-9609) AddPartitionMessage.getPartitions() can return null

2015-02-12 Thread Mithun Radhakrishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-9609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14319274#comment-14319274
 ] 

Mithun Radhakrishnan commented on HIVE-9609:


Actually, splitting out {{JSONAddPartitionMessage.partitions}} into a 
{{ListPartitionKeyName}} and {{ListListPartitionValues}} will be fairly 
intrusive. Separate JIRA, methinks.

The JSONMessageDeserializer will need to provide backward compatibility, etc. 
Let's hold off on that change.

 AddPartitionMessage.getPartitions() can return null
 ---

 Key: HIVE-9609
 URL: https://issues.apache.org/jira/browse/HIVE-9609
 Project: Hive
  Issue Type: Bug
Affects Versions: 1.2.0
Reporter: Sushanth Sowmyan
Assignee: Sushanth Sowmyan
 Attachments: HIVE-9609.2.patch, HIVE-9609.patch


 DbNotificationListener and NotificationListener both depend on 
 AddPartitionEvent.getPartitions() to get their partitions to trigger a 
 message, but this can be null if an AddPartitionEvent was initialized on a 
 PartitionSpec rather than a ListPartition.
 Also, AddPartitionEvent seems to have a duality, where getPartitions() works 
 only if instantiated on a ListPartition, and getPartitionIterator() works 
 only if instantiated on a PartitionSpec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-9681) Extend HiveAuthorizationProvider to support partition-sets.

2015-02-12 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-9681:
--

 Summary: Extend HiveAuthorizationProvider to support 
partition-sets.
 Key: HIVE-9681
 URL: https://issues.apache.org/jira/browse/HIVE-9681
 Project: Hive
  Issue Type: Bug
  Components: Security
Affects Versions: 0.14.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan


{{HiveAuthorizationProvider}} allows only for the authorization of a single 
partition at a time. For instance, when the {{StorageBasedAuthProvider}} must 
authorize an operation on a set of partitions (say from a 
PreDropPartitionEvent), each partition's data-directory needs to be checked 
individually. For N partitions, this results in N namenode calls.

I'd like to add {{authorize()}} overloads that accept multiple partitions. This 
will allow StorageBasedAuthProvider to make batched namenode calls. 

P.S. There's 2 further optimizations that are possible:

1. In the ideal case, we'd have a single call in 
{{org.apache.hadoop.fs.FileSystem}} to check access for an array of Paths, 
something like:
{code:title=FileSystem.java|borderStyle=solid}
@InterfaceAudience.LimitedPrivate({HDFS, Hive})
  public void access(Path [] paths, FsAction mode) throws 
AccessControlException, FileNotFoundException, IOException 
{...}
{code}

2. We can go one better if we could retrieve partition-locations in DirectSQL 
and use those for authorization. The EventListener-abstraction behind which the 
AuthProviders operate make this difficult. I can attempt to solve this using a 
PartitionSpec and a call-back into the ObjectStore from 
StorageBasedAuthProvider. I'll save this rigmarole for later.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-9609) AddPartitionMessage.getPartitions() can return null

2015-02-12 Thread Mithun Radhakrishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-9609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14319350#comment-14319350
 ] 

Mithun Radhakrishnan commented on HIVE-9609:


Also, could we remove the unnecessary null-checks in DbNotificationListener? 
The changes are in HIVE-9679.

 AddPartitionMessage.getPartitions() can return null
 ---

 Key: HIVE-9609
 URL: https://issues.apache.org/jira/browse/HIVE-9609
 Project: Hive
  Issue Type: Bug
Affects Versions: 1.2.0
Reporter: Sushanth Sowmyan
Assignee: Sushanth Sowmyan
 Attachments: HIVE-9609.2.patch, HIVE-9609.patch


 DbNotificationListener and NotificationListener both depend on 
 AddPartitionEvent.getPartitions() to get their partitions to trigger a 
 message, but this can be null if an AddPartitionEvent was initialized on a 
 PartitionSpec rather than a ListPartition.
 Also, AddPartitionEvent seems to have a duality, where getPartitions() works 
 only if instantiated on a ListPartition, and getPartitionIterator() works 
 only if instantiated on a PartitionSpec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (HIVE-9679) Remove redundant null-checks from DbNotificationListener.

2015-02-12 Thread Mithun Radhakrishnan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-9679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan resolved HIVE-9679.

Resolution: Duplicate

 Remove redundant null-checks from DbNotificationListener.
 -

 Key: HIVE-9679
 URL: https://issues.apache.org/jira/browse/HIVE-9679
 Project: Hive
  Issue Type: Bug
Affects Versions: 0.14.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
Priority: Minor
 Attachments: HIVE-9679.patch


 There's a couple of unnecessary null-checks in {{DbNotificationListener}}. 
 There's no way they'd fire. Shall we remove these?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-9679) Remove redundant null-checks from DbNotificationListener.

2015-02-12 Thread Mithun Radhakrishnan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-9679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9679:
---
Attachment: HIVE-9679.patch

Yikes, sorry. We could just clean this up as part of HIVE-9609, since 
DbNotificationListener is being modified there. I'll dupe the JIRA. Sorry about 
the spam.

 Remove redundant null-checks from DbNotificationListener.
 -

 Key: HIVE-9679
 URL: https://issues.apache.org/jira/browse/HIVE-9679
 Project: Hive
  Issue Type: Bug
Affects Versions: 0.14.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
Priority: Minor
 Attachments: HIVE-9679.patch


 There's a couple of unnecessary null-checks in {{DbNotificationListener}}. 
 There's no way they'd fire. Shall we remove these?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-9588) Reimplement HCatClientHMSImpl.dropPartitions() with HMSC.dropPartitions()

2015-02-11 Thread Mithun Radhakrishnan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-9588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9588:
---
Attachment: HIVE-9588.3.patch

Hey, [~sushanth]. Thanks for the initial review. I've moved the chicken-switch 
constant into HiveConf, as you'd suggested. Here's the updated patch.

 Reimplement HCatClientHMSImpl.dropPartitions() with HMSC.dropPartitions()
 -

 Key: HIVE-9588
 URL: https://issues.apache.org/jira/browse/HIVE-9588
 Project: Hive
  Issue Type: Bug
  Components: HCatalog, Metastore, Thrift API
Affects Versions: 0.14.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-9588.1.patch, HIVE-9588.2.patch, HIVE-9588.3.patch


 {{HCatClientHMSImpl.dropPartitions()}} currently has an embarrassingly 
 inefficient implementation. The partial partition-spec is converted into a 
 filter-string. The partitions are fetched from the server, and then dropped 
 one by one.
 Here's a reimplementation that uses the {{ExprNode}}-based 
 {{HiveMetaStoreClient.dropPartitions()}}. It cuts out the excessive 
 back-and-forth between the HMS and the client-side. It also reduces the 
 memory footprint (from loading all the partitions that are to be dropped). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-9588) Reimplement HCatClientHMSImpl.dropPartitions() with HMSC.dropPartitions()

2015-02-11 Thread Mithun Radhakrishnan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-9588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9588:
---
Attachment: HIVE-9588.3.patch

 Reimplement HCatClientHMSImpl.dropPartitions() with HMSC.dropPartitions()
 -

 Key: HIVE-9588
 URL: https://issues.apache.org/jira/browse/HIVE-9588
 Project: Hive
  Issue Type: Bug
  Components: HCatalog, Metastore, Thrift API
Affects Versions: 0.14.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-9588.1.patch, HIVE-9588.2.patch, HIVE-9588.3.patch


 {{HCatClientHMSImpl.dropPartitions()}} currently has an embarrassingly 
 inefficient implementation. The partial partition-spec is converted into a 
 filter-string. The partitions are fetched from the server, and then dropped 
 one by one.
 Here's a reimplementation that uses the {{ExprNode}}-based 
 {{HiveMetaStoreClient.dropPartitions()}}. It cuts out the excessive 
 back-and-forth between the HMS and the client-side. It also reduces the 
 memory footprint (from loading all the partitions that are to be dropped). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-9588) Reimplement HCatClientHMSImpl.dropPartitions() with HMSC.dropPartitions()

2015-02-11 Thread Mithun Radhakrishnan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-9588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9588:
---
Attachment: (was: HIVE-9588.3.patch)

 Reimplement HCatClientHMSImpl.dropPartitions() with HMSC.dropPartitions()
 -

 Key: HIVE-9588
 URL: https://issues.apache.org/jira/browse/HIVE-9588
 Project: Hive
  Issue Type: Bug
  Components: HCatalog, Metastore, Thrift API
Affects Versions: 0.14.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-9588.1.patch, HIVE-9588.2.patch, HIVE-9588.3.patch


 {{HCatClientHMSImpl.dropPartitions()}} currently has an embarrassingly 
 inefficient implementation. The partial partition-spec is converted into a 
 filter-string. The partitions are fetched from the server, and then dropped 
 one by one.
 Here's a reimplementation that uses the {{ExprNode}}-based 
 {{HiveMetaStoreClient.dropPartitions()}}. It cuts out the excessive 
 back-and-forth between the HMS and the client-side. It also reduces the 
 memory footprint (from loading all the partitions that are to be dropped). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-9609) AddPartitionMessage.getPartitions() can return null

2015-02-11 Thread Mithun Radhakrishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-9609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317374#comment-14317374
 ] 

Mithun Radhakrishnan commented on HIVE-9609:


Sorry for the delay. I'm going to need a little more time to study the second 
patch. I'd like to get my head around

But on first glance, I appreciate how streamlined {{AddPartitionEvent}}'s 
interface is now. I'm even more taken by {{Iterators}}.

Minor nitpicks:
1. Should we also change {{JSONMessageFactory.getPartitionKeyValues()}} to 
return {{IteratorMapString,String}}? Also, appropriately change 
{{JSONAddPartitionMessage}} constructor? Perhaps not a big deal, given that 
these are just names. But then again, perhaps every little counts?
2. Shall we split {{JSONAddPartitionMessage.partitions}} to include only 
part-values, and avoid the repetition of partition-key-names.
3. JSONMessageFactory.java imports java.util.*. That's likely my 
(code-editor's) doing. Could I please bother you to fix that?

I'd like to do a similar change for DropPartitionEvent/Message. I'll raise a 
JIRA shortly.

 AddPartitionMessage.getPartitions() can return null
 ---

 Key: HIVE-9609
 URL: https://issues.apache.org/jira/browse/HIVE-9609
 Project: Hive
  Issue Type: Bug
Affects Versions: 1.2.0
Reporter: Sushanth Sowmyan
Assignee: Sushanth Sowmyan
 Attachments: HIVE-9609.2.patch, HIVE-9609.patch


 DbNotificationListener and NotificationListener both depend on 
 AddPartitionEvent.getPartitions() to get their partitions to trigger a 
 message, but this can be null if an AddPartitionEvent was initialized on a 
 PartitionSpec rather than a ListPartition.
 Also, AddPartitionEvent seems to have a duality, where getPartitions() works 
 only if instantiated on a ListPartition, and getPartitionIterator() works 
 only if instantiated on a PartitionSpec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-9633) Add HCatClient.dropPartitions() overload to skip deletion of partition-directories.

2015-02-09 Thread Mithun Radhakrishnan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-9633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9633:
---
Attachment: HIVE-9633.1.patch

 Add HCatClient.dropPartitions() overload to skip deletion of 
 partition-directories.
 ---

 Key: HIVE-9633
 URL: https://issues.apache.org/jira/browse/HIVE-9633
 Project: Hive
  Issue Type: Bug
  Components: API, HCatalog, Metastore
Affects Versions: 0.14.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-9633.1.patch


 {{HCatClient.dropPartitions()}} doesn't provide a way to explicitly skip the 
 deletion of partition-directory, as {{HiveMetaStoreClient.dropPartitions()}} 
 does.
 This'll come in handy when using HCatClient to drop partitions, but not 
 delete data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-9588) Reimplement HCatClientHMSImpl.dropPartitions() with HMSC.dropPartitions()

2015-02-09 Thread Mithun Radhakrishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-9588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313261#comment-14313261
 ] 

Mithun Radhakrishnan commented on HIVE-9588:


A minor update:

1. Dropping 2K partitions using HCatClient.dropPartitions() used to take 204 
seconds for a managed table on my test setup (with an Oracle backend, and 
remote metastore). This now takes 83 seconds.
2. Dropping 5K partitions used to take about 7 minutes. It now takes 4.

 Reimplement HCatClientHMSImpl.dropPartitions() with HMSC.dropPartitions()
 -

 Key: HIVE-9588
 URL: https://issues.apache.org/jira/browse/HIVE-9588
 Project: Hive
  Issue Type: Bug
  Components: HCatalog, Metastore, Thrift API
Affects Versions: 0.14.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-9588.1.patch, HIVE-9588.2.patch


 {{HCatClientHMSImpl.dropPartitions()}} currently has an embarrassingly 
 inefficient implementation. The partial partition-spec is converted into a 
 filter-string. The partitions are fetched from the server, and then dropped 
 one by one.
 Here's a reimplementation that uses the {{ExprNode}}-based 
 {{HiveMetaStoreClient.dropPartitions()}}. It cuts out the excessive 
 back-and-forth between the HMS and the client-side. It also reduces the 
 memory footprint (from loading all the partitions that are to be dropped). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-9631) DirectSQL for HMS.drop_partitions_req().

2015-02-09 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-9631:
--

 Summary: DirectSQL for HMS.drop_partitions_req().
 Key: HIVE-9631
 URL: https://issues.apache.org/jira/browse/HIVE-9631
 Project: Hive
  Issue Type: Bug
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan


{{HiveMetaStore.drop_partitions_req()}} still seems to:

1. Load the full partition-list into memory
2. Iterate on the partition-list to check for {{isArchived}}, etc.
3. Doesn't use a dropPartitionsAndGetLocations() kind of mechanism.

[~selinazh] is working on pushing more of this down into the ObjectStore, 
similarly to the work for {{dropTable}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-9588) Reimplement HCatClientHMSImpl.dropPartitions() with HMSC.dropPartitions()

2015-02-09 Thread Mithun Radhakrishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-9588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313282#comment-14313282
 ] 

Mithun Radhakrishnan commented on HIVE-9588:


Another minor update: 

The numbers quoted above are slashed in half for EXTERNAL tables. Half the 
problem is the iterative deletion of partition directories.
1. In the short term, perhaps we could add an HCatClient.dropPartitions() 
overload that takes a deleteData argument, just as 
HiveMetaStoreClient.drop_partitions_req() does. This way, the caller can choose 
whether to delete the underlying data. (Should be beneficial for data-loading 
programs like GDM/Falcon.)
2. In the long term, we should consider classifying the directories so that we 
drop the common parent, rather than each partition-dir individually.

 Reimplement HCatClientHMSImpl.dropPartitions() with HMSC.dropPartitions()
 -

 Key: HIVE-9588
 URL: https://issues.apache.org/jira/browse/HIVE-9588
 Project: Hive
  Issue Type: Bug
  Components: HCatalog, Metastore, Thrift API
Affects Versions: 0.14.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-9588.1.patch, HIVE-9588.2.patch


 {{HCatClientHMSImpl.dropPartitions()}} currently has an embarrassingly 
 inefficient implementation. The partial partition-spec is converted into a 
 filter-string. The partitions are fetched from the server, and then dropped 
 one by one.
 Here's a reimplementation that uses the {{ExprNode}}-based 
 {{HiveMetaStoreClient.dropPartitions()}}. It cuts out the excessive 
 back-and-forth between the HMS and the client-side. It also reduces the 
 memory footprint (from loading all the partitions that are to be dropped). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-9609) AddPartitionMessage.getPartitions() can return null

2015-02-09 Thread Mithun Radhakrishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-9609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14312876#comment-14312876
 ] 

Mithun Radhakrishnan commented on HIVE-9609:


Hey, Sush. I heartily agree with the intention. It would be good to remove 
cases where all the partitions need to be loaded into memory at once. And for 
this, it makes sense to have AddPartitionEvent initializable only with an 
IteratorPartition. 

I like the idea of the IterablePartition. Perhaps the PartitionSpecProxy 
should have implemented IterablePartition. :/

I did run into a problem (that I can't completely recollect). You'll notice 
that {{PartitionSpecProxy.PartitionIterator}} has functionality that ideally 
belongs in {{Partition}}, such as {{setCreateTime()}}, {{putToParameters()}}, 
etc. The reason is that subclasses (such as {{PartitionSpecWithSharedSDProxy}}) 
construct Partition instances as needed, and won't be able to propagate 
setter-actions on Partitions to the underlying implementation. (At least, not 
without subclassing {{Partition}} itself.)

I'm fine with changing the MessageFactory interface, and AddPartitionEvent 
initializers, to eschew {{ListPartition}} objects. We'll also need to change 
{{HiveMetaStore.fireMetaStoreAddPartitionEvent()}} to conform to this.

 AddPartitionMessage.getPartitions() can return null
 ---

 Key: HIVE-9609
 URL: https://issues.apache.org/jira/browse/HIVE-9609
 Project: Hive
  Issue Type: Bug
Affects Versions: 1.2.0
Reporter: Sushanth Sowmyan
Assignee: Sushanth Sowmyan
 Attachments: HIVE-9609.patch


 DbNotificationListener and NotificationListener both depend on 
 AddPartitionEvent.getPartitions() to get their partitions to trigger a 
 message, but this can be null if an AddPartitionEvent was initialized on a 
 PartitionSpec rather than a ListPartition.
 Also, AddPartitionEvent seems to have a duality, where getPartitions() works 
 only if instantiated on a ListPartition, and getPartitionIterator() works 
 only if instantiated on a PartitionSpec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-9628) HiveMetaStoreClient.dropPartitions(...ListObjectPairInteger,byte[]...) doesn't take (boolean needResult)

2015-02-09 Thread Mithun Radhakrishnan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-9628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9628:
---
Attachment: HIVE-9628.1.patch

 HiveMetaStoreClient.dropPartitions(...ListObjectPairInteger,byte[]...) 
 doesn't take (boolean needResult)
 

 Key: HIVE-9628
 URL: https://issues.apache.org/jira/browse/HIVE-9628
 Project: Hive
  Issue Type: Bug
  Components: API, Metastore
Affects Versions: 0.14.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-9628.1.patch


 {{HiveMetaStoreClient::dropPartitions()}} assumes that the dropped 
 {{ListPartition}} must be returned to the caller. That's a lot of thrift 
 traffic that the caller might choose not to pay for.
 I propose an overload that retains the default behaviour, but allows 
 {{needResult}} to be overridden.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-9628) HiveMetaStoreClient.dropPartitions(...ListObjectPairInteger,byte[]...) doesn't take (boolean needResult)

2015-02-09 Thread Mithun Radhakrishnan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-9628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9628:
---
Status: Patch Available  (was: Open)

 HiveMetaStoreClient.dropPartitions(...ListObjectPairInteger,byte[]...) 
 doesn't take (boolean needResult)
 

 Key: HIVE-9628
 URL: https://issues.apache.org/jira/browse/HIVE-9628
 Project: Hive
  Issue Type: Bug
  Components: API, Metastore
Affects Versions: 0.14.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-9628.1.patch


 {{HiveMetaStoreClient::dropPartitions()}} assumes that the dropped 
 {{ListPartition}} must be returned to the caller. That's a lot of thrift 
 traffic that the caller might choose not to pay for.
 I propose an overload that retains the default behaviour, but allows 
 {{needResult}} to be overridden.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-9588) Reimplement HCatClientHMSImpl.dropPartitions() with HMSC.dropPartitions()

2015-02-09 Thread Mithun Radhakrishnan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-9588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9588:
---
Status: Open  (was: Patch Available)

 Reimplement HCatClientHMSImpl.dropPartitions() with HMSC.dropPartitions()
 -

 Key: HIVE-9588
 URL: https://issues.apache.org/jira/browse/HIVE-9588
 Project: Hive
  Issue Type: Bug
  Components: HCatalog, Metastore, Thrift API
Affects Versions: 0.14.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-9588.1.patch


 {{HCatClientHMSImpl.dropPartitions()}} currently has an embarrassingly 
 inefficient implementation. The partial partition-spec is converted into a 
 filter-string. The partitions are fetched from the server, and then dropped 
 one by one.
 Here's a reimplementation that uses the {{ExprNode}}-based 
 {{HiveMetaStoreClient.dropPartitions()}}. It cuts out the excessive 
 back-and-forth between the HMS and the client-side. It also reduces the 
 memory footprint (from loading all the partitions that are to be dropped). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-9588) Reimplement HCatClientHMSImpl.dropPartitions() with HMSC.dropPartitions()

2015-02-09 Thread Mithun Radhakrishnan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-9588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9588:
---
Attachment: HIVE-9588.2.patch

Refactored. Moved the new {{HiveMetaStoreClient.dropPartitions()}} to a 
separate JIRA.

 Reimplement HCatClientHMSImpl.dropPartitions() with HMSC.dropPartitions()
 -

 Key: HIVE-9588
 URL: https://issues.apache.org/jira/browse/HIVE-9588
 Project: Hive
  Issue Type: Bug
  Components: HCatalog, Metastore, Thrift API
Affects Versions: 0.14.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-9588.1.patch, HIVE-9588.2.patch


 {{HCatClientHMSImpl.dropPartitions()}} currently has an embarrassingly 
 inefficient implementation. The partial partition-spec is converted into a 
 filter-string. The partitions are fetched from the server, and then dropped 
 one by one.
 Here's a reimplementation that uses the {{ExprNode}}-based 
 {{HiveMetaStoreClient.dropPartitions()}}. It cuts out the excessive 
 back-and-forth between the HMS and the client-side. It also reduces the 
 memory footprint (from loading all the partitions that are to be dropped). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-9629) HCatClient.dropPartitions() needs speeding up.

2015-02-09 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-9629:
--

 Summary: HCatClient.dropPartitions() needs speeding up.
 Key: HIVE-9629
 URL: https://issues.apache.org/jira/browse/HIVE-9629
 Project: Hive
  Issue Type: Bug
  Components: HCatalog, Metastore
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan


This is an über JIRA for the work required to speed up 
HCatClient.dropPartitions().

As it stands right now, {{dropPartitions()}} is slow because it takes N 
thrift-calls to drop N partitions, and attempts to store all N partitions in 
memory while it executes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-9628) HiveMetaStoreClient.dropPartitions(...ListObjectPairInteger,byte[]...) doesn't take (boolean needResult)

2015-02-09 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-9628:
--

 Summary: 
HiveMetaStoreClient.dropPartitions(...ListObjectPairInteger,byte[]...) 
doesn't take (boolean needResult)
 Key: HIVE-9628
 URL: https://issues.apache.org/jira/browse/HIVE-9628
 Project: Hive
  Issue Type: Bug
  Components: API, Metastore
Affects Versions: 0.14.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan


{{HiveMetaStoreClient::dropPartitions()}} assumes that the dropped 
{{ListPartition}} must be returned to the caller. That's a lot of thrift 
traffic that the caller might choose not to pay for.

I propose an overload that retains the default behaviour, but allows 
{{needResult}} to be overridden.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-9628) HiveMetaStoreClient.dropPartitions(...ListObjectPairInteger,byte[]...) doesn't take (boolean needResult)

2015-02-09 Thread Mithun Radhakrishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-9628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313500#comment-14313500
 ] 

Mithun Radhakrishnan commented on HIVE-9628:


Hey, [~sershe]. Thanks for reviewing.

I've made HIVE-9588 depend on this JIRA. This overload will be used from 
{{HCatClient.dropPartitions()}}, since it doesn't need the returned 
partition-list.

 HiveMetaStoreClient.dropPartitions(...ListObjectPairInteger,byte[]...) 
 doesn't take (boolean needResult)
 

 Key: HIVE-9628
 URL: https://issues.apache.org/jira/browse/HIVE-9628
 Project: Hive
  Issue Type: Bug
  Components: API, Metastore
Affects Versions: 0.14.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-9628.1.patch


 {{HiveMetaStoreClient::dropPartitions()}} assumes that the dropped 
 {{ListPartition}} must be returned to the caller. That's a lot of thrift 
 traffic that the caller might choose not to pay for.
 I propose an overload that retains the default behaviour, but allows 
 {{needResult}} to be overridden.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-9633) Add HCatClient.dropPartitions() overload to skip deletion of partition-directories.

2015-02-09 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-9633:
--

 Summary: Add HCatClient.dropPartitions() overload to skip deletion 
of partition-directories.
 Key: HIVE-9633
 URL: https://issues.apache.org/jira/browse/HIVE-9633
 Project: Hive
  Issue Type: Bug
  Components: API, HCatalog, Metastore
Affects Versions: 0.14.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan


{{HCatClient.dropPartitions()}} doesn't provide a way to explicitly skip the 
deletion of partition-directory, as {{HiveMetaStoreClient.dropPartitions()}} 
does.

This'll come in handy when using HCatClient to drop partitions, but not delete 
data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-9588) Reimplement HCatClientHMSImpl.dropPartitions() with HMSC.dropPartitions()

2015-02-04 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-9588:
--

 Summary: Reimplement HCatClientHMSImpl.dropPartitions() with 
HMSC.dropPartitions()
 Key: HIVE-9588
 URL: https://issues.apache.org/jira/browse/HIVE-9588
 Project: Hive
  Issue Type: Bug
  Components: HCatalog, Metastore, Thrift API
Affects Versions: 0.14.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan


{{HCatClientHMSImpl.dropPartitions()}} currently has an embarrassingly 
inefficient implementation. The partial partition-spec is converted into a 
filter-string. The partitions are fetched from the server, and then dropped one 
by one.

Here's a reimplementation that uses the {{ExprNode}}-based 
{{HiveMetaStoreClient.dropPartitions()}}. It cuts out the excessive 
back-and-forth between the HMS and the client-side. It also reduces the memory 
footprint (from loading all the partitions that are to be dropped). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-9588) Reimplement HCatClientHMSImpl.dropPartitions() with HMSC.dropPartitions()

2015-02-04 Thread Mithun Radhakrishnan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-9588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9588:
---
Status: Patch Available  (was: Open)

 Reimplement HCatClientHMSImpl.dropPartitions() with HMSC.dropPartitions()
 -

 Key: HIVE-9588
 URL: https://issues.apache.org/jira/browse/HIVE-9588
 Project: Hive
  Issue Type: Bug
  Components: HCatalog, Metastore, Thrift API
Affects Versions: 0.14.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-9588.1.patch


 {{HCatClientHMSImpl.dropPartitions()}} currently has an embarrassingly 
 inefficient implementation. The partial partition-spec is converted into a 
 filter-string. The partitions are fetched from the server, and then dropped 
 one by one.
 Here's a reimplementation that uses the {{ExprNode}}-based 
 {{HiveMetaStoreClient.dropPartitions()}}. It cuts out the excessive 
 back-and-forth between the HMS and the client-side. It also reduces the 
 memory footprint (from loading all the partitions that are to be dropped). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-9588) Reimplement HCatClientHMSImpl.dropPartitions() with HMSC.dropPartitions()

2015-02-04 Thread Mithun Radhakrishnan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-9588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9588:
---
Attachment: HIVE-9588.1.patch

To help things along, I've added an overload to {{HMSC.dropPartitions()}} that 
allows us to skip returning the partition-list.

 Reimplement HCatClientHMSImpl.dropPartitions() with HMSC.dropPartitions()
 -

 Key: HIVE-9588
 URL: https://issues.apache.org/jira/browse/HIVE-9588
 Project: Hive
  Issue Type: Bug
  Components: HCatalog, Metastore, Thrift API
Affects Versions: 0.14.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-9588.1.patch


 {{HCatClientHMSImpl.dropPartitions()}} currently has an embarrassingly 
 inefficient implementation. The partial partition-spec is converted into a 
 filter-string. The partitions are fetched from the server, and then dropped 
 one by one.
 Here's a reimplementation that uses the {{ExprNode}}-based 
 {{HiveMetaStoreClient.dropPartitions()}}. It cuts out the excessive 
 back-and-forth between the HMS and the client-side. It also reduces the 
 memory footprint (from loading all the partitions that are to be dropped). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-9565) Minor cleanup in TestMetastoreExpr.

2015-02-03 Thread Mithun Radhakrishnan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-9565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9565:
---
Attachment: HIVE-9565.1.patch

 Minor cleanup in TestMetastoreExpr.
 ---

 Key: HIVE-9565
 URL: https://issues.apache.org/jira/browse/HIVE-9565
 Project: Hive
  Issue Type: Bug
  Components: Tests
Affects Versions: 0.14.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
Priority: Minor
 Attachments: HIVE-9565.1.patch


 Here's some minor cleanup in TestMetastoreExpr.
 0. There was an incomplete refactor in {{ExprBuilder.fn()}}, where the 
 {{TypeInfo}} object was hard-coded to {{booleanTypeInfo}}.
 1. I've removed the {{throws}} clauses for redundant exceptions.
 2. Expected exceptions have been labelled as {{ignore}}.
 As an aside, that {{ExprBuilder}} is nifty. It's just the thing to exercise 
 {{HiveMetastoreClient.listPartitionsByExpr()}}. I'm whacking this idea for an 
 upcoming fix to {{HCatClient.dropPartitions()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-9565) Minor cleanup in TestMetastoreExpr.

2015-02-03 Thread Mithun Radhakrishnan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-9565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9565:
---
Status: Patch Available  (was: Open)

 Minor cleanup in TestMetastoreExpr.
 ---

 Key: HIVE-9565
 URL: https://issues.apache.org/jira/browse/HIVE-9565
 Project: Hive
  Issue Type: Bug
  Components: Tests
Affects Versions: 0.14.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
Priority: Minor
 Attachments: HIVE-9565.1.patch


 Here's some minor cleanup in TestMetastoreExpr.
 0. There was an incomplete refactor in {{ExprBuilder.fn()}}, where the 
 {{TypeInfo}} object was hard-coded to {{booleanTypeInfo}}.
 1. I've removed the {{throws}} clauses for redundant exceptions.
 2. Expected exceptions have been labelled as {{ignore}}.
 As an aside, that {{ExprBuilder}} is nifty. It's just the thing to exercise 
 {{HiveMetastoreClient.listPartitionsByExpr()}}. I'm whacking this idea for an 
 upcoming fix to {{HCatClient.dropPartitions()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-9471) Bad seek in uncompressed ORC, at row-group boundary.

2015-02-03 Thread Mithun Radhakrishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-9471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14303839#comment-14303839
 ] 

Mithun Radhakrishnan commented on HIVE-9471:


Thanks for the advice and review, Prasanth. Much appreciated.

 Bad seek in uncompressed ORC, at row-group boundary.
 

 Key: HIVE-9471
 URL: https://issues.apache.org/jira/browse/HIVE-9471
 Project: Hive
  Issue Type: Bug
  Components: File Formats, Serializers/Deserializers
Affects Versions: 0.14.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Fix For: 1.2.0

 Attachments: HIVE-9471.2.patch, HIVE-9471.3.patch, data.txt, 
 orc_bad_seek_failure_case.hive, orc_bad_seek_setup.hive


 Under at least one specific condition, using index-filters in ORC causes a 
 bad seek into the ORC row-group.
 {code:title=stacktrace}
 java.io.IOException: java.lang.IllegalArgumentException: Seek in Stream for 
 column 2 kind DATA to 0 is outside of the data
   at 
 org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:507)
   at 
 org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:414)
   at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:138)
   at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:1655)
   at 
 org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:227)
   at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:159)
   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:370)
   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:305)
 ...
 Caused by: java.lang.IllegalArgumentException: Seek in Stream for column 2 
 kind DATA to 0 is outside of the data
   at 
 org.apache.hadoop.hive.ql.io.orc.InStream$UncompressedStream.seek(InStream.java:112)
   at 
 org.apache.hadoop.hive.ql.io.orc.InStream$UncompressedStream.seek(InStream.java:96)
   at 
 org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.seek(RunLengthIntegerReaderV2.java:310)
   at 
 org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$StringDictionaryTreeReader.seek(RecordReaderImpl.java:1596)
   at 
 org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$StringTreeReader.seek(RecordReaderImpl.java:1337)
   at 
 org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$StructTreeReader.seek(RecordReaderImpl.java:1852)
 {code}
 I'll attach the script to reproduce the problem herewith.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-9565) Minor cleanup in TestMetastoreExpr.

2015-02-03 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-9565:
--

 Summary: Minor cleanup in TestMetastoreExpr.
 Key: HIVE-9565
 URL: https://issues.apache.org/jira/browse/HIVE-9565
 Project: Hive
  Issue Type: Bug
  Components: Tests
Affects Versions: 0.14.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
Priority: Minor


Here's some minor cleanup in TestMetastoreExpr.
0. There was an incomplete refactor in {{ExprBuilder.fn()}}, where the 
{{TypeInfo}} object was hard-coded to {{booleanTypeInfo}}.
1. I've removed the {{throws}} clauses for redundant exceptions.
2. Expected exceptions have been labelled as {{ignore}}.

As an aside, that {{ExprBuilder}} is nifty. It's just the thing to exercise 
{{HiveMetastoreClient.listPartitionsByExpr()}}. I'm whacking this idea for an 
upcoming fix to {{HCatClient.dropPartitions()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-9553) Fix log-line in Partition Pruner

2015-02-02 Thread Mithun Radhakrishnan (JIRA)

Mithun Radhakrishnan created HIVE-9553:
--

 Summary: Fix log-line in Partition Pruner
 Key: HIVE-9553
 URL: https://issues.apache.org/jira/browse/HIVE-9553
 Project: Hive
  Issue Type: Bug
  Components: Logging
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
Priority: Trivial


Minor issue in logging the prune-expression in the PartitionPruner:

{code:title=PartitionPruner.java|titleBGColor=#F7D6C1|bgColor=#CE}
LOG.trace(prune Expression =  + prunerExpr == null ?  : prunerExpr);
{code}

Given the operator precedence order, this should read:

{code:title=PartitionPruner.java|titleBGColor=#F7D6C1|bgColor=#CE}
LOG.trace(prune Expression =  + (prunerExpr == null ?  : prunerExpr));
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-9553) Fix log-line in Partition Pruner

2015-02-02 Thread Mithun Radhakrishnan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-9553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9553:
---
Status: Patch Available  (was: Open)

 Fix log-line in Partition Pruner
 

 Key: HIVE-9553
 URL: https://issues.apache.org/jira/browse/HIVE-9553
 Project: Hive
  Issue Type: Bug
  Components: Logging
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
Priority: Trivial
 Attachments: HIVE-9553.1.patch


 Minor issue in logging the prune-expression in the PartitionPruner:
 {code:title=PartitionPruner.java|titleBGColor=#F7D6C1|bgColor=#CE}
 LOG.trace(prune Expression =  + prunerExpr == null ?  : prunerExpr);
 {code}
 Given the operator precedence order, this should read:
 {code:title=PartitionPruner.java|titleBGColor=#F7D6C1|bgColor=#CE}
 LOG.trace(prune Expression =  + (prunerExpr == null ?  : prunerExpr));
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-9553) Fix log-line in Partition Pruner

2015-02-02 Thread Mithun Radhakrishnan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-9553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9553:
---
Affects Version/s: 0.14.0

 Fix log-line in Partition Pruner
 

 Key: HIVE-9553
 URL: https://issues.apache.org/jira/browse/HIVE-9553
 Project: Hive
  Issue Type: Bug
  Components: Logging
Affects Versions: 0.14.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
Priority: Trivial
 Attachments: HIVE-9553.1.patch


 Minor issue in logging the prune-expression in the PartitionPruner:
 {code:title=PartitionPruner.java|titleBGColor=#F7D6C1|bgColor=#CE}
 LOG.trace(prune Expression =  + prunerExpr == null ?  : prunerExpr);
 {code}
 Given the operator precedence order, this should read:
 {code:title=PartitionPruner.java|titleBGColor=#F7D6C1|bgColor=#CE}
 LOG.trace(prune Expression =  + (prunerExpr == null ?  : prunerExpr));
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-9471) Bad seek in uncompressed ORC, at row-group boundary.

2015-02-02 Thread Mithun Radhakrishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-9471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14301693#comment-14301693
 ] 

Mithun Radhakrishnan commented on HIVE-9471:


Hey, [~prasanth_j]. Does this patch look alright now?

 Bad seek in uncompressed ORC, at row-group boundary.
 

 Key: HIVE-9471
 URL: https://issues.apache.org/jira/browse/HIVE-9471
 Project: Hive
  Issue Type: Bug
  Components: File Formats, Serializers/Deserializers
Affects Versions: 0.14.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-9471.2.patch, HIVE-9471.3.patch, data.txt, 
 orc_bad_seek_failure_case.hive, orc_bad_seek_setup.hive


 Under at least one specific condition, using index-filters in ORC causes a 
 bad seek into the ORC row-group.
 {code:title=stacktrace}
 java.io.IOException: java.lang.IllegalArgumentException: Seek in Stream for 
 column 2 kind DATA to 0 is outside of the data
   at 
 org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:507)
   at 
 org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:414)
   at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:138)
   at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:1655)
   at 
 org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:227)
   at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:159)
   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:370)
   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:305)
 ...
 Caused by: java.lang.IllegalArgumentException: Seek in Stream for column 2 
 kind DATA to 0 is outside of the data
   at 
 org.apache.hadoop.hive.ql.io.orc.InStream$UncompressedStream.seek(InStream.java:112)
   at 
 org.apache.hadoop.hive.ql.io.orc.InStream$UncompressedStream.seek(InStream.java:96)
   at 
 org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.seek(RunLengthIntegerReaderV2.java:310)
   at 
 org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$StringDictionaryTreeReader.seek(RecordReaderImpl.java:1596)
   at 
 org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$StringTreeReader.seek(RecordReaderImpl.java:1337)
   at 
 org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$StructTreeReader.seek(RecordReaderImpl.java:1852)
 {code}
 I'll attach the script to reproduce the problem herewith.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-9471) Bad seek in uncompressed ORC, at row-group boundary.

2015-01-29 Thread Mithun Radhakrishnan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-9471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9471:
---
Attachment: HIVE-9471.3.patch

Here's the same, with the LENGTH stream suppressed.

 Bad seek in uncompressed ORC, at row-group boundary.
 

 Key: HIVE-9471
 URL: https://issues.apache.org/jira/browse/HIVE-9471
 Project: Hive
  Issue Type: Bug
  Components: File Formats, Serializers/Deserializers
Affects Versions: 0.14.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-9471.2.patch, HIVE-9471.3.patch, data.txt, 
 orc_bad_seek_failure_case.hive, orc_bad_seek_setup.hive


 Under at least one specific condition, using index-filters in ORC causes a 
 bad seek into the ORC row-group.
 {code:title=stacktrace}
 java.io.IOException: java.lang.IllegalArgumentException: Seek in Stream for 
 column 2 kind DATA to 0 is outside of the data
   at 
 org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:507)
   at 
 org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:414)
   at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:138)
   at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:1655)
   at 
 org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:227)
   at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:159)
   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:370)
   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:305)
 ...
 Caused by: java.lang.IllegalArgumentException: Seek in Stream for column 2 
 kind DATA to 0 is outside of the data
   at 
 org.apache.hadoop.hive.ql.io.orc.InStream$UncompressedStream.seek(InStream.java:112)
   at 
 org.apache.hadoop.hive.ql.io.orc.InStream$UncompressedStream.seek(InStream.java:96)
   at 
 org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.seek(RunLengthIntegerReaderV2.java:310)
   at 
 org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$StringDictionaryTreeReader.seek(RecordReaderImpl.java:1596)
   at 
 org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$StringTreeReader.seek(RecordReaderImpl.java:1337)
   at 
 org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$StructTreeReader.seek(RecordReaderImpl.java:1852)
 {code}
 I'll attach the script to reproduce the problem herewith.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-9471) Bad seek in uncompressed ORC, at row-group boundary.

2015-01-29 Thread Mithun Radhakrishnan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-9471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9471:
---
Status: Open  (was: Patch Available)

Modifying the comment for the second null-check.

 Bad seek in uncompressed ORC, at row-group boundary.
 

 Key: HIVE-9471
 URL: https://issues.apache.org/jira/browse/HIVE-9471
 Project: Hive
  Issue Type: Bug
  Components: File Formats, Serializers/Deserializers
Affects Versions: 0.14.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-9471.2.patch, HIVE-9471.3.patch, data.txt, 
 orc_bad_seek_failure_case.hive, orc_bad_seek_setup.hive


 Under at least one specific condition, using index-filters in ORC causes a 
 bad seek into the ORC row-group.
 {code:title=stacktrace}
 java.io.IOException: java.lang.IllegalArgumentException: Seek in Stream for 
 column 2 kind DATA to 0 is outside of the data
   at 
 org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:507)
   at 
 org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:414)
   at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:138)
   at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:1655)
   at 
 org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:227)
   at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:159)
   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:370)
   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:305)
 ...
 Caused by: java.lang.IllegalArgumentException: Seek in Stream for column 2 
 kind DATA to 0 is outside of the data
   at 
 org.apache.hadoop.hive.ql.io.orc.InStream$UncompressedStream.seek(InStream.java:112)
   at 
 org.apache.hadoop.hive.ql.io.orc.InStream$UncompressedStream.seek(InStream.java:96)
   at 
 org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.seek(RunLengthIntegerReaderV2.java:310)
   at 
 org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$StringDictionaryTreeReader.seek(RecordReaderImpl.java:1596)
   at 
 org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$StringTreeReader.seek(RecordReaderImpl.java:1337)
   at 
 org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$StructTreeReader.seek(RecordReaderImpl.java:1852)
 {code}
 I'll attach the script to reproduce the problem herewith.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-9471) Bad seek in uncompressed ORC, at row-group boundary.

2015-01-29 Thread Mithun Radhakrishnan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-9471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-9471:
---
Attachment: (was: HIVE-9471.3.patch)

 Bad seek in uncompressed ORC, at row-group boundary.
 

 Key: HIVE-9471
 URL: https://issues.apache.org/jira/browse/HIVE-9471
 Project: Hive
  Issue Type: Bug
  Components: File Formats, Serializers/Deserializers
Affects Versions: 0.14.0
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: HIVE-9471.2.patch, HIVE-9471.3.patch, data.txt, 
 orc_bad_seek_failure_case.hive, orc_bad_seek_setup.hive


 Under at least one specific condition, using index-filters in ORC causes a 
 bad seek into the ORC row-group.
 {code:title=stacktrace}
 java.io.IOException: java.lang.IllegalArgumentException: Seek in Stream for 
 column 2 kind DATA to 0 is outside of the data
   at 
 org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:507)
   at 
 org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:414)
   at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:138)
   at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:1655)
   at 
 org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:227)
   at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:159)
   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:370)
   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:305)
 ...
 Caused by: java.lang.IllegalArgumentException: Seek in Stream for column 2 
 kind DATA to 0 is outside of the data
   at 
 org.apache.hadoop.hive.ql.io.orc.InStream$UncompressedStream.seek(InStream.java:112)
   at 
 org.apache.hadoop.hive.ql.io.orc.InStream$UncompressedStream.seek(InStream.java:96)
   at 
 org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.seek(RunLengthIntegerReaderV2.java:310)
   at 
 org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$StringDictionaryTreeReader.seek(RecordReaderImpl.java:1596)
   at 
 org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$StringTreeReader.seek(RecordReaderImpl.java:1337)
   at 
 org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$StructTreeReader.seek(RecordReaderImpl.java:1852)
 {code}
 I'll attach the script to reproduce the problem herewith.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

1 2 3 4 >

1 - 100 of 397 matches

Mail list logo