[ https://issues.apache.org/jira/browse/HIVE-2950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13404357#comment-13404357 ]
Travis Crawford commented on HIVE-2950: --------------------------------------- Status update: Looking into this a bit more, I think we can avoid storing the cols in the metastore if we simply allow partitions to report cols from the serde. Something like this: {code:title=ql/src/java/org/apache/hadoop/hive/ql/metadata/Partition.java} public List<FieldSchema> getCols() { - return tPartition.getSd().getCols(); + if (SerDeUtils.shouldGetColsFromSerDe(table.getSerializationLib())) { + return table.getCols(); + } else { + return tPartition.getSd().getCols(); + } } {code} For thrift/protobuf this would work perfectly, since you want all records to have the newest schema, and let thrift/protobuf deal with figuring out missing values, unknown fields, etc. Thoughts? > Hive should store the full table schema in partition storage descriptors > ------------------------------------------------------------------------ > > Key: HIVE-2950 > URL: https://issues.apache.org/jira/browse/HIVE-2950 > Project: Hive > Issue Type: Bug > Reporter: Travis Crawford > Assignee: Travis Crawford > Attachments: HIVE-2950.D2769.1.patch > > > Hive tables have a schema, which is copied into the partition storage > descriptor when adding a partition. Currently only columns stored in the > table storage descriptor are copied - columns that are reported by the serde > are not copied. Instead of copying the table storage descriptor columns into > the partition columns, the full table schema should be copied. > DETAILS > This is a little long but is necessary to show 3 things: current behavior > when explicitly listing columns, behavior with HIVE-2941 patched in and serde > reported columns, and finally the behavior with this patch (full table schema > copied into the partition storage descriptor). > Here's an example of what currently happens. Note the following: > * the two manually-defined fields defined for the table are listed in the > table storage descriptor. > * both fields are present in the partition storage descriptor > This works great because users who query for a partition can look at its > storage descriptor and get the schema. > {code} > hive> create external table foo_test (name string, age int) partitioned by > (part_dt string); > hive> describe extended foo_test; > OK > name string > age int > part_dt string > > Detailed Table Information Table(tableName:foo_test, dbName:travis_test, > owner:travis, createTime:1334256062, lastAccessTime:0, retention:0, > sd:StorageDescriptor(cols:[FieldSchema(name:name, type:string, comment:null), > FieldSchema(name:age, type:int, comment:null), FieldSchema(name:part_dt, > type:string, comment:null)], > location:hdfs://foo.com/warehouse/travis_test.db/foo_test, > inputFormat:org.apache.hadoop.mapred.TextInputFormat, > outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, > compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, > serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, > parameters:{serialization.format=1}), bucketCols:[], sortCols:[], > parameters:{}, primaryRegionName:, secondaryRegions:[]), > partitionKeys:[FieldSchema(name:part_dt, type:string, comment:null)], > parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1334256062}, > viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE) > Time taken: 0.082 seconds > hive> alter table foo_test add partition (part_dt = '20120331T000000Z') > location 'hdfs://foo.com/foo/2012/03/31/00'; > hive> describe extended foo_test partition (part_dt = '20120331T000000Z'); > OK > name string > age int > part_dt string > > Detailed Partition Information Partition(values:[20120331T000000Z], > dbName:travis_test, tableName:foo_test, createTime:1334256131, > lastAccessTime:0, sd:StorageDescriptor(cols:[FieldSchema(name:name, > type:string, comment:null), FieldSchema(name:age, type:int, comment:null), > FieldSchema(name:part_dt, type:string, comment:null)], > location:hdfs://foo.com/foo/2012/03/31/00, > inputFormat:org.apache.hadoop.mapred.TextInputFormat, > outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, > compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, > serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, > parameters:{serialization.format=1}), bucketCols:[], sortCols:[], > parameters:{}, primaryRegionName:, secondaryRegions:[]), > parameters:{transient_lastDdlTime=1334256131}) > {code} > CURRENT BEHAVIOR WITH HIVE-2941 PATCHED IN > Now let's examine what happens when creating a table when the serde reports > the schema. Notice the following: > * The table storage descriptor contains an empty list of columns. However, > the table schema is available from the serde reflecting on the serialization > class. > * The partition storage descriptor does contain a single "part_dt" column > that was copied from the table partition keys. The actual data columns are > not present. > {code} > hive> create external table travis_test.person_test partitioned by (part_dt > string) row format serde "com.twitter.elephantbird.hive.serde.ThriftSerDe" > with serdeproperties > ("serialization.class"="com.twitter.elephantbird.examples.thrift.Person") > stored as inputformat > "com.twitter.elephantbird.mapred.input.HiveMultiInputFormat" outputformat > "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"; > OK > Time taken: 0.08 seconds > hive> describe extended person_test; > OK > name struct<first_name:string,last_name:string> from deserializer > id int from deserializer > email string from deserializer > phones array<struct<number:string,type:struct<value:int>>> from > deserializer > part_dt string > > Detailed Table Information Table(tableName:person_test, > dbName:travis_test, owner:travis, createTime:1334256942, lastAccessTime:0, > retention:0, sd:StorageDescriptor(cols:[], > location:hdfs://foo.com/warehouse/travis_test.db/person_test, > inputFormat:com.twitter.elephantbird.mapred.input.HiveMultiInputFormat, > outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, > compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, > serializationLib:com.twitter.elephantbird.hive.serde.ThriftSerDe, > parameters:{serialization.class=com.twitter.elephantbird.examples.thrift.Person, > serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, > primaryRegionName:, secondaryRegions:[]), > partitionKeys:[FieldSchema(name:part_dt, type:string, comment:null)], > parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1334256942}, > viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE) > Time taken: 0.147 seconds > hive> alter table person_test add partition (part_dt = '20120331T000000Z') > location 'hdfs://foo.com/foo/2012/03/31/00'; > OK > Time taken: 0.149 seconds > hive> describe extended person_test partition (part_dt = '20120331T000000Z'); > OK > part_dt string > > Detailed Partition Information Partition(values:[20120331T000000Z], > dbName:travis_test, tableName:person_test, createTime:1334257029, > lastAccessTime:0, sd:StorageDescriptor(cols:[FieldSchema(name:part_dt, > type:string, comment:null)], location:hdfs://foo.com/foo/2012/03/31/00, > inputFormat:com.twitter.elephantbird.mapred.input.HiveMultiInputFormat, > outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, > compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, > serializationLib:com.twitter.elephantbird.hive.serde.ThriftSerDe, > parameters:{serialization.class=com.twitter.elephantbird.examples.thrift.Person, > serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, > primaryRegionName:, secondaryRegions:[]), > parameters:{transient_lastDdlTime=1334257029}) > Time taken: 0.106 seconds > hive> > {code} > PROPOSED BEHAVIOR > I believe the correct thing to do is copy the full table schema > (serde-reported columns + partition keys) into the partition storage > descriptor. Notice the following: > * Table storage descriptor does not contain any columns, because they are > reported by the serde. > * Partition storage descriptor now contains both the serde-reported schema, > and full table schema. > {code} > hive> create external table travis_test.person_test partitioned by (part_dt > string) row format serde "com.twitter.elephantbird.hive.serde.ThriftSerDe" > with serdeproperties > ("serialization.class"="com.twitter.elephantbird.examples.thrift.Person") > stored as inputformat > "com.twitter.elephantbird.mapred.input.HiveMultiInputFormat" outputformat > "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"; > OK > Time taken: 0.076 seconds > hive> describe extended person_test; > > OK > > name struct<first_name:string,last_name:string> > from deserializer > id int from deserializer > email string from deserializer > phones array<struct<number:string,type:struct<value:int>>> from > deserializer > part_dt string > > Detailed Table Information Table(tableName:person_test, > dbName:travis_test, owner:travis, createTime:1334257489, lastAccessTime:0, > retention:0, sd:StorageDescriptor(cols:[], > location:hdfs://foo.com/warehouse/travis_test.db/person_test, > inputFormat:com.twitter.elephantbird.mapred.input.HiveMultiInputFormat, > outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, > compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, > serializationLib:com.twitter.elephantbird.hive.serde.ThriftSerDe, > parameters:{serialization.class=com.twitter.elephantbird.examples.thrift.Person, > serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, > primaryRegionName:, secondaryRegions:[]), > partitionKeys:[FieldSchema(name:part_dt, type:string, comment:null)], > parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1334257489}, > viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE) > Time taken: 0.155 seconds > hive> alter table person_test add partition (part_dt = '20120331T000000Z') > location 'hdfs://foo.com/foo/2012/03/31/00'; > OK > > Time taken: 0.296 seconds > hive> describe extended person_test partition (part_dt = '20120331T000000Z'); > > OK > > name struct<first_name:string,last_name:string> > from deserializer > id int from deserializer > email string from deserializer > phones array<struct<number:string,type:struct<value:int>>> from > deserializer > part_dt string > > Detailed Partition Information Partition(values:[20120331T000000Z], > dbName:travis_test, tableName:person_test, createTime:1334257504, > lastAccessTime:0, sd:StorageDescriptor(cols:[FieldSchema(name:name, > type:struct<first_name:string,last_name:string>, comment:from deserializer), > FieldSchema(name:id, type:int, comment:from deserializer), > FieldSchema(name:email, type:string, comment:from deserializer), > FieldSchema(name:phones, > type:array<struct<number:string,type:struct<value:int>>>, comment:from > deserializer), FieldSchema(name:part_dt, type:string, comment:null)], > location:hdfs://foo.com/foo/2012/03/31/00, > inputFormat:com.twitter.elephantbird.mapred.input.HiveMultiInputFormat, > outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, > compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, > serializationLib:com.twitter.elephantbird.hive.serde.ThriftSerDe, > parameters:{serialization.class=com.twitter.elephantbird.examples.thrift.Person, > serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, > primaryRegionName:, secondaryRegions:[]), > parameters:{transient_lastDdlTime=1334257504}) > Time taken: 0.133 seconds > hive> > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira