[jira] [Commented] (HIVE-2950) Hive should store the full table schema in partition storage descriptors

Travis Crawford (JIRA) Fri, 29 Jun 2012 18:24:47 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-2950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13404357#comment-13404357
 ]


Travis Crawford commented on HIVE-2950:
---------------------------------------

Status update:

Looking into this a bit more, I think we can avoid storing the cols in the 
metastore if we simply allow partitions to report cols from the serde. 
Something like this:

{code:title=ql/src/java/org/apache/hadoop/hive/ql/metadata/Partition.java}
   public List<FieldSchema> getCols() {
-    return tPartition.getSd().getCols();
+    if (SerDeUtils.shouldGetColsFromSerDe(table.getSerializationLib())) {
+      return table.getCols();
+    } else {
+      return tPartition.getSd().getCols();
+    }
   }
{code}

For thrift/protobuf this would work perfectly, since you want all records to 
have the newest schema, and let thrift/protobuf deal with figuring out missing 
values, unknown fields, etc.

Thoughts?
                
> Hive should store the full table schema in partition storage descriptors
> ------------------------------------------------------------------------
>
>                 Key: HIVE-2950
>                 URL: https://issues.apache.org/jira/browse/HIVE-2950
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Travis Crawford
>            Assignee: Travis Crawford
>         Attachments: HIVE-2950.D2769.1.patch
>
>
> Hive tables have a schema, which is copied into the partition storage 
> descriptor when adding a partition. Currently only columns stored in the 
> table storage descriptor are copied - columns that are reported by the serde 
> are not copied. Instead of copying the table storage descriptor columns into 
> the partition columns, the full table schema should be copied.
> DETAILS
> This is a little long but is necessary to show 3 things: current behavior 
> when explicitly listing columns, behavior with HIVE-2941 patched in and serde 
> reported columns, and finally the behavior with this patch (full table schema 
> copied into the partition storage descriptor).
> Here's an example of what currently happens. Note the following:
> * the two manually-defined fields defined for the table are listed in the 
> table storage descriptor.
> * both fields are present in the partition storage descriptor
> This works great because users who query for a partition can look at its 
> storage descriptor and get the schema.
> {code}
> hive> create external table foo_test (name string, age int) partitioned by 
> (part_dt string);
> hive> describe extended foo_test;
> OK
> name  string  
> age   int     
> part_dt       string  
>                
> Detailed Table Information    Table(tableName:foo_test, dbName:travis_test, 
> owner:travis, createTime:1334256062, lastAccessTime:0, retention:0, 
> sd:StorageDescriptor(cols:[FieldSchema(name:name, type:string, comment:null), 
> FieldSchema(name:age, type:int, comment:null), FieldSchema(name:part_dt, 
> type:string, comment:null)], 
> location:hdfs://foo.com/warehouse/travis_test.db/foo_test, 
> inputFormat:org.apache.hadoop.mapred.TextInputFormat, 
> outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, 
> compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, 
> serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> parameters:{serialization.format=1}), bucketCols:[], sortCols:[], 
> parameters:{}, primaryRegionName:, secondaryRegions:[]), 
> partitionKeys:[FieldSchema(name:part_dt, type:string, comment:null)], 
> parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1334256062}, 
> viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE)     
> Time taken: 0.082 seconds
> hive> alter table foo_test add partition (part_dt = '20120331T000000Z') 
> location 'hdfs://foo.com/foo/2012/03/31/00';
> hive> describe extended foo_test partition (part_dt = '20120331T000000Z');
> OK
> name  string  
> age   int     
> part_dt       string  
>                
> Detailed Partition Information        Partition(values:[20120331T000000Z], 
> dbName:travis_test, tableName:foo_test, createTime:1334256131, 
> lastAccessTime:0, sd:StorageDescriptor(cols:[FieldSchema(name:name, 
> type:string, comment:null), FieldSchema(name:age, type:int, comment:null), 
> FieldSchema(name:part_dt, type:string, comment:null)], 
> location:hdfs://foo.com/foo/2012/03/31/00, 
> inputFormat:org.apache.hadoop.mapred.TextInputFormat, 
> outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, 
> compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, 
> serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> parameters:{serialization.format=1}), bucketCols:[], sortCols:[], 
> parameters:{}, primaryRegionName:, secondaryRegions:[]), 
> parameters:{transient_lastDdlTime=1334256131})      
> {code}
> CURRENT BEHAVIOR WITH HIVE-2941 PATCHED IN
> Now let's examine what happens when creating a table when the serde reports 
> the schema. Notice the following:
> * The table storage descriptor contains an empty list of columns. However, 
> the table schema is available from the serde reflecting on the serialization 
> class.
> * The partition storage descriptor does contain a single "part_dt" column 
> that was copied from the table partition keys. The actual data columns are 
> not present.
> {code}
> hive> create external table travis_test.person_test partitioned by (part_dt 
> string) row format serde "com.twitter.elephantbird.hive.serde.ThriftSerDe" 
> with serdeproperties 
> ("serialization.class"="com.twitter.elephantbird.examples.thrift.Person") 
> stored as inputformat 
> "com.twitter.elephantbird.mapred.input.HiveMultiInputFormat" outputformat 
> "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";
> OK
> Time taken: 0.08 seconds
> hive> describe extended person_test;
> OK
> name  struct<first_name:string,last_name:string>      from deserializer
> id    int     from deserializer
> email string  from deserializer
> phones        array<struct<number:string,type:struct<value:int>>>     from 
> deserializer
> part_dt       string  
>                
> Detailed Table Information    Table(tableName:person_test, 
> dbName:travis_test, owner:travis, createTime:1334256942, lastAccessTime:0, 
> retention:0, sd:StorageDescriptor(cols:[], 
> location:hdfs://foo.com/warehouse/travis_test.db/person_test, 
> inputFormat:com.twitter.elephantbird.mapred.input.HiveMultiInputFormat, 
> outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, 
> compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, 
> serializationLib:com.twitter.elephantbird.hive.serde.ThriftSerDe, 
> parameters:{serialization.class=com.twitter.elephantbird.examples.thrift.Person,
>  serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, 
> primaryRegionName:, secondaryRegions:[]), 
> partitionKeys:[FieldSchema(name:part_dt, type:string, comment:null)], 
> parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1334256942}, 
> viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE) 
> Time taken: 0.147 seconds
> hive> alter table person_test add partition (part_dt = '20120331T000000Z') 
> location 'hdfs://foo.com/foo/2012/03/31/00'; 
> OK
> Time taken: 0.149 seconds
> hive> describe extended person_test partition (part_dt = '20120331T000000Z');
> OK
> part_dt       string  
>                
> Detailed Partition Information        Partition(values:[20120331T000000Z], 
> dbName:travis_test, tableName:person_test, createTime:1334257029, 
> lastAccessTime:0, sd:StorageDescriptor(cols:[FieldSchema(name:part_dt, 
> type:string, comment:null)], location:hdfs://foo.com/foo/2012/03/31/00, 
> inputFormat:com.twitter.elephantbird.mapred.input.HiveMultiInputFormat, 
> outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, 
> compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, 
> serializationLib:com.twitter.elephantbird.hive.serde.ThriftSerDe, 
> parameters:{serialization.class=com.twitter.elephantbird.examples.thrift.Person,
>  serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, 
> primaryRegionName:, secondaryRegions:[]), 
> parameters:{transient_lastDdlTime=1334257029}) 
> Time taken: 0.106 seconds
> hive> 
> {code}
> PROPOSED BEHAVIOR
> I believe the correct thing to do is copy the full table schema 
> (serde-reported columns + partition keys) into the partition storage 
> descriptor. Notice the following:
> * Table storage descriptor does not contain any columns, because they are 
> reported by the serde.
> * Partition storage descriptor now contains both the serde-reported schema, 
> and full table schema.
> {code}
> hive> create external table travis_test.person_test partitioned by (part_dt 
> string) row format serde "com.twitter.elephantbird.hive.serde.ThriftSerDe" 
> with serdeproperties 
> ("serialization.class"="com.twitter.elephantbird.examples.thrift.Person") 
> stored as inputformat 
> "com.twitter.elephantbird.mapred.input.HiveMultiInputFormat" outputformat 
> "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";
> OK
> Time taken: 0.076 seconds
> hive> describe extended person_test;                                          
>                                                                               
>              OK                                                               
>                                                                               
>                           name    struct<first_name:string,last_name:string>  
>     from deserializer
> id    int     from deserializer
> email string  from deserializer
> phones        array<struct<number:string,type:struct<value:int>>>     from 
> deserializer
> part_dt       string  
>                
> Detailed Table Information    Table(tableName:person_test, 
> dbName:travis_test, owner:travis, createTime:1334257489, lastAccessTime:0, 
> retention:0, sd:StorageDescriptor(cols:[], 
> location:hdfs://foo.com/warehouse/travis_test.db/person_test, 
> inputFormat:com.twitter.elephantbird.mapred.input.HiveMultiInputFormat, 
> outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, 
> compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, 
> serializationLib:com.twitter.elephantbird.hive.serde.ThriftSerDe, 
> parameters:{serialization.class=com.twitter.elephantbird.examples.thrift.Person,
>  serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, 
> primaryRegionName:, secondaryRegions:[]), 
> partitionKeys:[FieldSchema(name:part_dt, type:string, comment:null)], 
> parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1334257489}, 
> viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE) 
> Time taken: 0.155 seconds
> hive> alter table person_test add partition (part_dt = '20120331T000000Z') 
> location 'hdfs://foo.com/foo/2012/03/31/00';
> OK                                                                            
>                                                                               
>              Time taken: 0.296 seconds                                        
> hive> describe extended person_test partition (part_dt = '20120331T000000Z'); 
>                                                                               
>              OK                                                               
>                                                                               
>                           name    struct<first_name:string,last_name:string>  
>     from deserializer
> id    int     from deserializer
> email string  from deserializer
> phones        array<struct<number:string,type:struct<value:int>>>     from 
> deserializer
> part_dt       string  
>                
> Detailed Partition Information        Partition(values:[20120331T000000Z], 
> dbName:travis_test, tableName:person_test, createTime:1334257504, 
> lastAccessTime:0, sd:StorageDescriptor(cols:[FieldSchema(name:name, 
> type:struct<first_name:string,last_name:string>, comment:from deserializer), 
> FieldSchema(name:id, type:int, comment:from deserializer), 
> FieldSchema(name:email, type:string, comment:from deserializer), 
> FieldSchema(name:phones, 
> type:array<struct<number:string,type:struct<value:int>>>, comment:from 
> deserializer), FieldSchema(name:part_dt, type:string, comment:null)], 
> location:hdfs://foo.com/foo/2012/03/31/00, 
> inputFormat:com.twitter.elephantbird.mapred.input.HiveMultiInputFormat, 
> outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, 
> compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, 
> serializationLib:com.twitter.elephantbird.hive.serde.ThriftSerDe, 
> parameters:{serialization.class=com.twitter.elephantbird.examples.thrift.Person,
>  serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, 
> primaryRegionName:, secondaryRegions:[]), 
> parameters:{transient_lastDdlTime=1334257504})  
> Time taken: 0.133 seconds
> hive> 
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2950) Hive should store the full table schema in partition storage descriptors

Reply via email to