[jira] [Commented] (HIVE-2246) Dedupe tables' column schemas from partitions in the metastore db

[email protected] (JIRA) Sun, 24 Jul 2011 23:47:45 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-2246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13070337#comment-13070337
 ]

[email protected] commented on HIVE-2246:
-----------------------------------------------------

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/1183/#review1176
-----------------------------------------------------------

trunk/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql
<https://reviews.apache.org/r/1183/#comment2467>

    is the CHARSET (latin1) the same as SDS? This will require the user's 
comments to be in latin1 which prevents UTF chars.

trunk/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql
<https://reviews.apache.org/r/1183/#comment2466>

    can you also add migration script for derby? we support derby as a default 
metastore RDBMS as well. 

trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java
<https://reviews.apache.org/r/1183/#comment2468>

    here do you check if the 'alter table' command changes the schema (columns 
definition)? If it just set a table property, then you don't need to create a 
new ColumnDescriptor right?

    Also if a table's schema got changed, a new CD will be created, but the old 
partition will still have the old CDs. When we query the old partition, do we 
use the old partitons's CD or the table's CD? 

    Also in the above case, when you run 'desc table partition 
<old_partition>', do you return the old partition's CD or the table's CD? 

- Ning

On 2011-07-22 05:30:29, Sohan Jain wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/1183/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-07-22 05:30:29)
bq.  
bq.  
bq.  Review request for hive, Ning Zhang and Paul Yang.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  This patch tries to make minimal changes to the API while keeping 
migration short and somewhat easy to revert.
bq.  
bq.  The new schema can be described as follows:
bq.  - CDS is a table corresponding to Column Descriptor objects.  Currently, 
it only stores a CD_ID.
bq.  - COLUMNS_V2 is a table corresponding to MFieldSchema objects, or columns. 
 A Column Descriptor holds a list of columns.  COLUMNS_V2 has a foreign key to 
the CD_ID to which it belongs.
bq.  - SDS was modified to reference a Column Descriptor. So SDS now has a 
foreign key to a CD_ID which describes its columns.
bq.  
bq.  During migration, we create Column Descriptors for tables in a 
straightforward manner: their columns are now just wrapped inside a column 
descriptor.  The SDS of partitions use their parent table's column descriptor, 
since currently a partition and its table share the same list of columns.
bq.  
bq.  When altering or adding a partition, give it it's parent table's column 
descriptor IF the columns they describe are the same.  Otherwise, create a new 
column descriptor for its columns.
bq.  
bq.  When adding or altering a table, create a new column descriptor every time.
bq.  
bq.  Whenever you drop a storage descriptor (e.g, when dropping tables or 
partitions), check to see if the related column descriptor has any other 
references in the table.  That is, check to see if any other storage 
descriptors point to that column descriptor.  If none do, then delete that 
column descriptor.  This check is in place so we don't have unreferenced column 
descriptors and columns hanging around after schema evolution for tables.
bq.  
bq.  
bq.  This addresses bug HIVE-2246.
bq.      https://issues.apache.org/jira/browse/HIVE-2246
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    trunk/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql 
PRE-CREATION 
bq.    
trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 
1148945 
bq.    
trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MColumnDescriptor.java
 PRE-CREATION 
bq.    
trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MStorageDescriptor.java
 1148945 
bq.    trunk/metastore/src/model/package.jdo 1148945 
bq.  
bq.  Diff: https://reviews.apache.org/r/1183/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  Passes facebook's regression testing and all existing test cases.  In one 
instance, before migration, the overhead involved with storage descriptors and 
columns was ~11 GB.  After migration, the overhead was ~1.5 GB.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Sohan
bq.  
bq.

> Dedupe tables' column schemas from partitions in the metastore db
> -----------------------------------------------------------------
>
>                 Key: HIVE-2246
>                 URL: https://issues.apache.org/jira/browse/HIVE-2246
>             Project: Hive
>          Issue Type: Improvement
>          Components: Metastore
>            Reporter: Sohan Jain
>            Assignee: Sohan Jain
>         Attachments: HIVE-2246.2.patch, HIVE-2246.3.patch
>
>
> Note: this patch proposes a schema change, and is therefore incompatible with 
> the current metastore.
> We can re-organize the JDO models to reduce space usage to keep the metastore 
> scalable for the future.  Currently, partitions are the fastest growing 
> objects in the metastore, and the metastore keeps a separate copy of the 
> columns list for each partition.  We can normalize the metastore db by 
> decoupling Columns from Storage Descriptors and not storing duplicate lists 
> of the columns for each partition. 
> An idea is to create an additional level of indirection with a "Column 
> Descriptor" that has a list of columns.  A table has a reference to its 
> latest Column Descriptor (note: a table may have more than one Column 
> Descriptor in the case of schema evolution).  Partitions and Indexes can 
> reference the same Column Descriptors as their parent table.
> Currently, the COLUMNS table in the metastore has roughly (number of 
> partitions + number of tables) * (average number of columns pertable) rows.  
> We can reduce this to (number of tables) * (average number of columns per 
> table) rows, while incurring a small cost proportional to the number of 
> tables to store the Column Descriptors.
> Please see the latest review board for additional implementation details.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2246) Dedupe tables' column schemas from partitions in the metastore db

Reply via email to