Hive should store the full table schema in partition storage descriptors
------------------------------------------------------------------------

                 Key: HIVE-2950
                 URL: https://issues.apache.org/jira/browse/HIVE-2950
             Project: Hive
          Issue Type: Bug
            Reporter: Travis Crawford
            Assignee: Travis Crawford


Hive tables have a schema, which is copied into the partition storage 
descriptor when adding a partition. Currently only columns stored in the table 
storage descriptor are copied - columns that are reported by the serde are not 
copied. Instead of copying the table storage descriptor columns into the 
partition columns, the full table schema should be copied.


DETAILS

This is a little long but is necessary to show 3 things: current behavior when 
explicitly listing columns, behavior with HIVE-2941 patched in and serde 
reported columns, and finally the behavior with this patch (full table schema 
copied into the partition storage descriptor).


Here's an example of what currently happens. Note the following:

* the two manually-defined fields defined for the table are listed in the table 
storage descriptor.
* both fields are present in the partition storage descriptor

This works great because users who query for a partition can look at its 
storage descriptor and get the schema.

{code}
hive> create external table foo_test (name string, age int) partitioned by 
(part_dt string);
hive> describe extended foo_test;
OK
name    string  
age     int     
part_dt string  
                 
Detailed Table Information      Table(tableName:foo_test, dbName:travis_test, 
owner:travis, createTime:1334256062, lastAccessTime:0, retention:0, 
sd:StorageDescriptor(cols:[FieldSchema(name:name, type:string, comment:null), 
FieldSchema(name:age, type:int, comment:null), FieldSchema(name:part_dt, 
type:string, comment:null)], 
location:hdfs://foo.com/warehouse/travis_test.db/foo_test, 
inputFormat:org.apache.hadoop.mapred.TextInputFormat, 
outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, 
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, 
serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
parameters:{serialization.format=1}), bucketCols:[], sortCols:[], 
parameters:{}, primaryRegionName:, secondaryRegions:[]), 
partitionKeys:[FieldSchema(name:part_dt, type:string, comment:null)], 
parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1334256062}, 
viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE)     
Time taken: 0.082 seconds

hive> alter table foo_test add partition (part_dt = '20120331T000000Z') 
location 'hdfs://foo.com/foo/2012/03/31/00';
hive> describe extended foo_test partition (part_dt = '20120331T000000Z');
OK
name    string  
age     int     
part_dt string  
                 
Detailed Partition Information  Partition(values:[20120331T000000Z], 
dbName:travis_test, tableName:foo_test, createTime:1334256131, 
lastAccessTime:0, sd:StorageDescriptor(cols:[FieldSchema(name:name, 
type:string, comment:null), FieldSchema(name:age, type:int, comment:null), 
FieldSchema(name:part_dt, type:string, comment:null)], 
location:hdfs://foo.com/foo/2012/03/31/00, 
inputFormat:org.apache.hadoop.mapred.TextInputFormat, 
outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, 
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, 
serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
parameters:{serialization.format=1}), bucketCols:[], sortCols:[], 
parameters:{}, primaryRegionName:, secondaryRegions:[]), 
parameters:{transient_lastDdlTime=1334256131})      
{code}




CURRENT BEHAVIOR WITH HIVE-2941 PATCHED IN

Now let's examine what happens when creating a table when the serde reports the 
schema. Notice the following:

* The table storage descriptor contains an empty list of columns. However, the 
table schema is available from the serde reflecting on the serialization class.
* The partition storage descriptor does contain a single "part_dt" column that 
was copied from the table partition keys. The actual data columns are not 
present.

{code}
hive> create external table travis_test.person_test partitioned by (part_dt 
string) row format serde "com.twitter.elephantbird.hive.serde.ThriftSerDe" with 
serdeproperties 
("serialization.class"="com.twitter.elephantbird.examples.thrift.Person") 
stored as inputformat 
"com.twitter.elephantbird.mapred.input.HiveMultiInputFormat" outputformat 
"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";
OK
Time taken: 0.08 seconds
hive> describe extended person_test;
OK
name    struct<first_name:string,last_name:string>      from deserializer
id      int     from deserializer
email   string  from deserializer
phones  array<struct<number:string,type:struct<value:int>>>     from 
deserializer
part_dt string  
                 
Detailed Table Information      Table(tableName:person_test, 
dbName:travis_test, owner:travis, createTime:1334256942, lastAccessTime:0, 
retention:0, sd:StorageDescriptor(cols:[], 
location:hdfs://foo.com/warehouse/travis_test.db/person_test, 
inputFormat:com.twitter.elephantbird.mapred.input.HiveMultiInputFormat, 
outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, 
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, 
serializationLib:com.twitter.elephantbird.hive.serde.ThriftSerDe, 
parameters:{serialization.class=com.twitter.elephantbird.examples.thrift.Person,
 serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, 
primaryRegionName:, secondaryRegions:[]), 
partitionKeys:[FieldSchema(name:part_dt, type:string, comment:null)], 
parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1334256942}, 
viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE) 
Time taken: 0.147 seconds
hive> alter table person_test add partition (part_dt = '20120331T000000Z') 
location 'hdfs://foo.com/foo/2012/03/31/00'; 
OK
Time taken: 0.149 seconds
hive> describe extended person_test partition (part_dt = '20120331T000000Z');
OK
part_dt string  
                 
Detailed Partition Information  Partition(values:[20120331T000000Z], 
dbName:travis_test, tableName:person_test, createTime:1334257029, 
lastAccessTime:0, sd:StorageDescriptor(cols:[FieldSchema(name:part_dt, 
type:string, comment:null)], location:hdfs://foo.com/foo/2012/03/31/00, 
inputFormat:com.twitter.elephantbird.mapred.input.HiveMultiInputFormat, 
outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, 
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, 
serializationLib:com.twitter.elephantbird.hive.serde.ThriftSerDe, 
parameters:{serialization.class=com.twitter.elephantbird.examples.thrift.Person,
 serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, 
primaryRegionName:, secondaryRegions:[]), 
parameters:{transient_lastDdlTime=1334257029}) 
Time taken: 0.106 seconds
hive> 
{code}


PROPOSED BEHAVIOR


I believe the correct thing to do is copy the full table schema (serde-reported 
columns + partition keys) into the partition storage descriptor. Notice the 
following:

* Table storage descriptor does not contain any columns, because they are 
reported by the serde.
* Partition storage descriptor now contains both the serde-reported schema, and 
full table schema.

{code}
hive> create external table travis_test.person_test partitioned by (part_dt 
string) row format serde "com.twitter.elephantbird.hive.serde.ThriftSerDe" with 
serdeproperties 
("serialization.class"="com.twitter.elephantbird.examples.thrift.Person") 
stored as inputformat 
"com.twitter.elephantbird.mapred.input.HiveMultiInputFormat" outputformat 
"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";
OK
Time taken: 0.076 seconds
hive> describe extended person_test;                                            
                                                                                
         OK                                                                     
                                                                                
                  name    struct<first_name:string,last_name:string>      from 
deserializer
id      int     from deserializer
email   string  from deserializer
phones  array<struct<number:string,type:struct<value:int>>>     from 
deserializer
part_dt string  
                 
Detailed Table Information      Table(tableName:person_test, 
dbName:travis_test, owner:travis, createTime:1334257489, lastAccessTime:0, 
retention:0, sd:StorageDescriptor(cols:[], 
location:hdfs://foo.com/warehouse/travis_test.db/person_test, 
inputFormat:com.twitter.elephantbird.mapred.input.HiveMultiInputFormat, 
outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, 
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, 
serializationLib:com.twitter.elephantbird.hive.serde.ThriftSerDe, 
parameters:{serialization.class=com.twitter.elephantbird.examples.thrift.Person,
 serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, 
primaryRegionName:, secondaryRegions:[]), 
partitionKeys:[FieldSchema(name:part_dt, type:string, comment:null)], 
parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1334257489}, 
viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE) 
Time taken: 0.155 seconds
hive> alter table person_test add partition (part_dt = '20120331T000000Z') 
location 'hdfs://foo.com/foo/2012/03/31/00';
OK                                                                              
                                                                                
         Time taken: 0.296 seconds                                        
hive> describe extended person_test partition (part_dt = '20120331T000000Z');   
                                                                                
         OK                                                                     
                                                                                
                  name    struct<first_name:string,last_name:string>      from 
deserializer
id      int     from deserializer
email   string  from deserializer
phones  array<struct<number:string,type:struct<value:int>>>     from 
deserializer
part_dt string  
                 
Detailed Partition Information  Partition(values:[20120331T000000Z], 
dbName:travis_test, tableName:person_test, createTime:1334257504, 
lastAccessTime:0, sd:StorageDescriptor(cols:[FieldSchema(name:name, 
type:struct<first_name:string,last_name:string>, comment:from deserializer), 
FieldSchema(name:id, type:int, comment:from deserializer), 
FieldSchema(name:email, type:string, comment:from deserializer), 
FieldSchema(name:phones, 
type:array<struct<number:string,type:struct<value:int>>>, comment:from 
deserializer), FieldSchema(name:part_dt, type:string, comment:null)], 
location:hdfs://foo.com/foo/2012/03/31/00, 
inputFormat:com.twitter.elephantbird.mapred.input.HiveMultiInputFormat, 
outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, 
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, 
serializationLib:com.twitter.elephantbird.hive.serde.ThriftSerDe, 
parameters:{serialization.class=com.twitter.elephantbird.examples.thrift.Person,
 serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, 
primaryRegionName:, secondaryRegions:[]), 
parameters:{transient_lastDdlTime=1334257504})  
Time taken: 0.133 seconds
hive> 
{code}


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to