The response of GetResultSetMetadata is inconsistent with TCLIService.thrift for complex types

2017-11-26 Thread Joseph Yen
I was trying to add decimal, timestamp, date, array, map type support to
PyHive DBAPI. In order to parse the result set correctly, I have to know
the result set schema for each SELECT. For simple types(integer, string,
timestamp, decimal, …), it’s not a problem. I can get all information by
calling HiveServer2.GetResultSetMetadata. But for complex types(array, map,
struct), the nested type information is missing. I can’t find a way to know
if it’s an integer array or a string array.

According to TCLIService.thrift

, recursively defined types such as array, map should be
described by TTypeEntry.arrayEntry, TTypeEntry.mapEntry rather than
TTypeEntry.primitivyEntry
in the first element ofTypeDesc.types. The nested types should be
reside inTypeDesc.types`
as following elements, and be pointed from the first element.

However, I got just a single TTypeEntry.primitivyEntry in TypeDesc.types
with TPrimitiveTypeEntry.type = ARRAY_TYPE when I actually called
GetResultSetMetadata for the query SELECT array(1, 2, 3) .

It violated both the descriptions of “TTypeDesc employs a type list that
maps
integer “pointers” to TTypeEntry objects”

and “The primitive type token. This must satisfy the condition that type is
in the PRIMITIVE_TYPES set.”


I tried the following script.

create temporary table dummy(a int);insert into table dummy values
(1), (2), (3);create temporary table tt(a int,  b string, c map);insert into table tt select 1, 'a', map(3,
array('a','b','c')) from dummy limit 1;select * from tt;

And called GetResultSetMetadata right after executing the SELECT query.
The value of response.schema.columns was

[TColumnDesc(columnName='tt.a', typeDesc=TTypeDesc(
  types=[
TTypeEntry(primitiveEntry=TPrimitiveTypeEntry(type=3,
typeQualifiers=None), arrayEntry=None, mapEntry=None,
structEntry=None, unionEntry=None, userDefinedTypeEntry=None)]),
position=1, comment=None),
 TColumnDesc(columnName='tt.b', typeDesc=TTypeDesc(types=[
TTypeEntry(primitiveEntry=TPrimitiveTypeEntry(type=7,
typeQualifiers=None), arrayEntry=None, mapEntry=None,
structEntry=None, unionEntry=None, userDefinedTypeEntry=None)]),
position=2, comment=None),
 TColumnDesc(columnName='tt.c', typeDesc=TTypeDesc(types=[
TTypeEntry(primitiveEntry=TPrimitiveTypeEntry(type=11,
typeQualifiers=None), arrayEntry=None, mapEntry=None,
structEntry=None, unionEntry=None, userDefinedTypeEntry=None)]),
position=3, comment=None)]

However, according to the thrift file, it should be

[TColumnDesc(columnName='tt.a', typeDesc=TTypeDesc(types=[
  TTypeEntry(primitiveEntry=TPrimitiveTypeEntry(type=3,
typeQualifiers=None), arrayEntry=None, mapEntry=None,
structEntry=None, unionEntry=None, userDefinedTypeEntry=None)]),
position=1, comment=None),
 TColumnDesc(columnName='tt.b', typeDesc=TTypeDesc(types=[
  TTypeEntry(primitiveEntry=TPrimitiveTypeEntry(type=7,
typeQualifiers=None), arrayEntry=None, mapEntry=None,
structEntry=None, unionEntry=None, userDefinedTypeEntry=None)]),
position=2, comment=None),
 TColumnDesc(columnName='tt.c', typeDesc=TTypeDesc(types=[
  TTypeEntry(primitiveEntry=None, arrayEntry=None,
mapEntry=TMapTypeEntry(keyTypePtr=1, valueTypePtr=2),
structEntry=None, unionEntry=None, userDefinedTypeEntry=None),
  TTypeEntry(primitiveEntry=TPrimitiveTypeEntry(type=3,
typeQualifiers=None), arrayEntry=None, mapEntry=None,
structEntry=None, unionEntry=None, userDefinedTypeEntry=None),
  TTypeEntry(primitiveEntry=None,
arrayEntry=TArrayTypeEntry(objectTypePtr=3), mapEntry=None,
structEntry=None, unionEntry=None, userDefinedTypeEntry=None),
  TTypeEntry(primitiveEntry=TPrimitiveTypeEntry(type=3,
typeQualifiers=None), arrayEntry=None, mapEntry=None,
structEntry=None, unionEntry=None, userDefinedTypeEntry=None)
]), position=3, comment=None)]

I found the related function in hive codebase.
https://github.com/apache/hive/blob/release-1.2.1/service/src/java/org/apache/hive/service/cli/TypeDescriptor.java#L66-L76
It seems that this function always put TPrimitiveTypeEntry to TTypeDesc.type,
even for complex type like array and map which is inconsistent with the
thirft file.
​


[jira] [Created] (HIVE-18148) NPE in SparkDynamicPartitionPruningResolver

2017-11-26 Thread Rui Li (JIRA)
Rui Li created HIVE-18148:
-

 Summary: NPE in SparkDynamicPartitionPruningResolver
 Key: HIVE-18148
 URL: https://issues.apache.org/jira/browse/HIVE-18148
 Project: Hive
  Issue Type: Bug
  Components: Spark
Reporter: Rui Li
Assignee: Rui Li


The stack trace is:
{noformat}
2017-11-27T10:32:38,752 ERROR [e6c8aab5-ddd2-461d-b185-a7597c3e7519 main] 
ql.Driver: FAILED: NullPointerException null
java.lang.NullPointerException
at 
org.apache.hadoop.hive.ql.optimizer.physical.SparkDynamicPartitionPruningResolver$SparkDynamicPartitionPruningDispatcher.dispatch(SparkDynamicPartitionPruningResolver.java:100)
at 
org.apache.hadoop.hive.ql.lib.TaskGraphWalker.dispatch(TaskGraphWalker.java:111)
at 
org.apache.hadoop.hive.ql.lib.TaskGraphWalker.walk(TaskGraphWalker.java:180)
at 
org.apache.hadoop.hive.ql.lib.TaskGraphWalker.startWalking(TaskGraphWalker.java:125)
at 
org.apache.hadoop.hive.ql.optimizer.physical.SparkDynamicPartitionPruningResolver.resolve(SparkDynamicPartitionPruningResolver.java:74)
at 
org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.optimizeTaskPlan(SparkCompiler.java:568)
{noformat}
At this stage, there shouldn't be a DPP sink whose target map work is null. The 
root cause seems to be a malformed operator tree generated by SplitOpTreeForDPP.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Re: Review Request 63972: [HIVE-18037] Migrate Slider LLAP package to YARN Service framework for Hadoop 3.x

2017-11-26 Thread Gour Saha


> On Nov. 21, 2017, 4:09 a.m., Sergey Shelukhin wrote:
> > llap-server/bin/llapDaemon.sh
> > Line 116 (original), 116 (patched)
> > 
> >
> > what is this change for?

The process that needs to be run as launch_command (check templates.py) needs 
to be a foreground process. That's why I was asking offline if anything else 
other than this llap app-package uses this script.


> On Nov. 21, 2017, 4:09 a.m., Sergey Shelukhin wrote:
> > llap-server/src/java/org/apache/hadoop/hive/llap/cli/LlapOptionsProcessor.java
> > Line 65 (original), 65 (patched)
> > 
> >
> > hmm... is it possible to keep old names as backward compat for scripts? 
> > or accept both names

The packaging in 3.x has changed significantly and hence trying to do anything 
for the sake of backward compatibility won't be worthwhile. Also, since Slider 
is being deprecated and going forward the Apache project is being wind down, 
there is no point in keeping any traces of its name. All traces of Slider 
keyword will be completely removed once all features of the packaging will be 
migrated to YARN Services.


> On Nov. 21, 2017, 4:09 a.m., Sergey Shelukhin wrote:
> > llap-server/src/java/org/apache/hadoop/hive/llap/cli/LlapSliderUtils.java
> > Line 47 (original), 46 (patched)
> > 
> >
> > is this still needed?

Will be removed in the next pass once status and diagnostics are migrated. Its 
mentioned in the jira as well.


> On Nov. 21, 2017, 4:09 a.m., Sergey Shelukhin wrote:
> > llap-server/src/java/org/apache/hadoop/hive/llap/cli/LlapSliderUtils.java
> > Lines 176 (patched)
> > 
> >
> > is it a good idea to ignore all exceptions? the old code used to ignore 
> > UnknownApp.. only

Unfortunately the YARN Service API does not throw such an exception and hence 
it does not provide a differentiator. Since this is in start cluster, its okay 
to ignore and move on, since if destroy was not done successfully, start will 
fail.


> On Nov. 21, 2017, 4:09 a.m., Sergey Shelukhin wrote:
> > llap-server/src/java/org/apache/hadoop/hive/llap/cli/LlapSliderUtils.java
> > Lines 182 (patched)
> > 
> >
> > should this be configurable? or at least a constant

Every app-package can choose their own location where they upload the package. 
It doesn't has to be this specific path. As long as this same path is specified 
in Yarnfile (refer templates.py). The reason I did not create a constant is 
because there is no other refernce of this path in the Java land. The only 
other reference is from templates.py. Nevertheless I can create a constant for 
it in the next pass.


> On Nov. 21, 2017, 4:09 a.m., Sergey Shelukhin wrote:
> > llap-server/src/java/org/apache/hadoop/hive/llap/cli/LlapStatusServiceDriver.java
> > Line 352 (original)
> > 
> >
> > where did the timeout logic go? I saw the code above that seems to fail 
> > immediately when the app is not running, but no timeout logic

This method is in LlapSliderUtils and is being used from there.


> On Nov. 21, 2017, 4:09 a.m., Sergey Shelukhin wrote:
> > llap-server/src/main/resources/package.py
> > Line 184 (original)
> > 
> >
> > hmm... that doesn't do anything anymore? I think at least the scripts 
> > would still be needed, right?

nope, scripts are not needed anymore.


> On Nov. 21, 2017, 4:09 a.m., Sergey Shelukhin wrote:
> > llap-server/src/main/resources/templates.py
> > Lines 43 (patched)
> > 
> >
> > how does it know what LLAP_DAEMON_OPTS is, and other stuff like 
> > HEAPSIZE? it doesn't seem to be mentioned elsewhere in the patch and 
> > doesn't seem to follow the convention (e.g. component name is LLAP without 
> > DAEMON). Just checking; it used to have a fancy name like site.global. ...

YARN Services automatically takes care of setting the env variables defined in 
Yarnfile before calling the launch_command.


- Gour


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63972/#review191563
---


On Nov. 21, 2017, 1:37 a.m., Gour Saha wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63972/
> ---
> 
> (Updated Nov. 21, 2017, 1:37 a.m.)
> 
> 
> Review