Re: Review Request: HIVE-2246: Dedupe tables' column schemas from partitions in the metastore db

2011-08-08 Thread Sohan Jain

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/1183/
---

(Updated 2011-08-08 20:55:11.546253)


Review request for hive, Ning Zhang and Paul Yang.


Changes
---

added derby upgrade and revert-the-upgrade script


Summary
---

This patch tries to make minimal changes to the API while keeping migration 
short and somewhat easy to revert.

The new schema can be described as follows:
- CDS is a table corresponding to Column Descriptor objects.  Currently, it 
only stores a CD_ID.
- COLUMNS_V2 is a table corresponding to MFieldSchema objects, or columns.  A 
Column Descriptor holds a list of columns.  COLUMNS_V2 has a foreign key to the 
CD_ID to which it belongs.
- SDS was modified to reference a Column Descriptor. So SDS now has a foreign 
key to a CD_ID which describes its columns.

During migration, we create Column Descriptors for tables in a straightforward 
manner: their columns are now just wrapped inside a column descriptor.  The SDS 
of partitions use their parent table's column descriptor, since currently a 
partition and its table share the same list of columns.

When altering or adding a partition, give it it's parent table's column 
descriptor IF the columns they describe are the same.  Otherwise, create a new 
column descriptor for its columns.

When adding or altering a table, create a new column descriptor every time.

Whenever you drop a storage descriptor (e.g, when dropping tables or 
partitions), check to see if the related column descriptor has any other 
references in the table.  That is, check to see if any other storage 
descriptors point to that column descriptor.  If none do, then delete that 
column descriptor.  This check is in place so we don't have unreferenced column 
descriptors and columns hanging around after schema evolution for tables.


This addresses bug HIVE-2246.
https://issues.apache.org/jira/browse/HIVE-2246


Diffs (updated)
-

  trunk/metastore/scripts/upgrade/derby/008-HIVE-2246.derby.sql PRE-CREATION 
  trunk/metastore/scripts/upgrade/derby/008-REVERT-HIVE-2246.derby.sql 
PRE-CREATION 
  trunk/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql PRE-CREATION 
  trunk/metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java 
1153927 
  trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 
1153927 
  
trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MColumnDescriptor.java
 PRE-CREATION 
  
trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MStorageDescriptor.java
 1153927 
  trunk/metastore/src/model/package.jdo 1153927 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java 1153927 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/MetaDataFormatUtils.java 
1153927 

Diff: https://reviews.apache.org/r/1183/diff


Testing
---

Passes facebook's regression testing and all existing test cases.  In one 
instance, before migration, the overhead involved with storage descriptors and 
columns was ~11 GB.  After migration, the overhead was ~1.5 GB.


Thanks,

Sohan



Re: Review Request: HIVE-2246: Dedupe tables' column schemas from partitions in the metastore db

2011-08-08 Thread Sohan Jain

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/1183/
---

(Updated 2011-08-08 21:19:06.999293)


Review request for hive, Ning Zhang and Paul Yang.


Changes
---

revised description for latest changes


Summary (updated)
---

This patch tries to make minimal changes to the API while keeping migration 
short and somewhat easy to revert.

The new schema can be described as follows:
- CDS is a table corresponding to Column Descriptor objects.  Currently, it 
only stores a CD_ID.
- COLUMNS_V2 is a table corresponding to MFieldSchema objects, or columns.  A 
Column Descriptor holds a list of columns.  COLUMNS_V2 has a foreign key to the 
CD_ID to which it belongs.
- SDS was modified to reference a Column Descriptor. So SDS now has a foreign 
key to a CD_ID which describes its columns.

During migration, we create Column Descriptors for tables in a straightforward 
manner: their columns are now just wrapped inside a column descriptor.  The SDS 
of partitions use their parent table's column descriptor, since currently a 
partition and its table share the same list of columns.

When altering or adding a partition, give it it's parent table's column 
descriptor IF the columns they describe are the same.  Otherwise, create a new 
column descriptor for its columns.

When creating a table, create a new column descriptor every time.  When 
altering a table, only construct a new column descriptor if the columns list 
has changed.

Whenever you drop a storage descriptor (e.g, when dropping tables or 
partitions), check to see if the related column descriptor has any other 
references in the table.  That is, check to see if any other storage 
descriptors point to that column descriptor.  If none do, then delete that 
column descriptor.  This check is in place so we don't have unreferenced column 
descriptors and columns hanging around after schema evolution for tables.


This addresses bug HIVE-2246.
https://issues.apache.org/jira/browse/HIVE-2246


Diffs
-

  trunk/metastore/scripts/upgrade/derby/008-HIVE-2246.derby.sql PRE-CREATION 
  trunk/metastore/scripts/upgrade/derby/008-REVERT-HIVE-2246.derby.sql 
PRE-CREATION 
  trunk/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql PRE-CREATION 
  trunk/metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java 
1153927 
  trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 
1153927 
  
trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MColumnDescriptor.java
 PRE-CREATION 
  
trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MStorageDescriptor.java
 1153927 
  trunk/metastore/src/model/package.jdo 1153927 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java 1153927 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/MetaDataFormatUtils.java 
1153927 

Diff: https://reviews.apache.org/r/1183/diff


Testing
---

Passes facebook's regression testing and all existing test cases.  In one 
instance, before migration, the overhead involved with storage descriptors and 
columns was ~11 GB.  After migration, the overhead was ~1.5 GB.


Thanks,

Sohan



Re: Review Request: HIVE-2246: Dedupe tables' column schemas from partitions in the metastore db

2011-08-08 Thread Sohan Jain

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/1183/
---

(Updated 2011-08-08 21:29:23.722825)


Review request for hive, Ning Zhang and Paul Yang.


Changes
---

Revert the changes to describe table T partition P, so that it always shows 
the table T's schema.  If a table's schema has changed, we do not support 
querying on the old partition's schema at the moment.


Summary
---

This patch tries to make minimal changes to the API while keeping migration 
short and somewhat easy to revert.

The new schema can be described as follows:
- CDS is a table corresponding to Column Descriptor objects.  Currently, it 
only stores a CD_ID.
- COLUMNS_V2 is a table corresponding to MFieldSchema objects, or columns.  A 
Column Descriptor holds a list of columns.  COLUMNS_V2 has a foreign key to the 
CD_ID to which it belongs.
- SDS was modified to reference a Column Descriptor. So SDS now has a foreign 
key to a CD_ID which describes its columns.

During migration, we create Column Descriptors for tables in a straightforward 
manner: their columns are now just wrapped inside a column descriptor.  The SDS 
of partitions use their parent table's column descriptor, since currently a 
partition and its table share the same list of columns.

When altering or adding a partition, give it it's parent table's column 
descriptor IF the columns they describe are the same.  Otherwise, create a new 
column descriptor for its columns.

When creating a table, create a new column descriptor every time.  When 
altering a table, only construct a new column descriptor if the columns list 
has changed.

Whenever you drop a storage descriptor (e.g, when dropping tables or 
partitions), check to see if the related column descriptor has any other 
references in the table.  That is, check to see if any other storage 
descriptors point to that column descriptor.  If none do, then delete that 
column descriptor.  This check is in place so we don't have unreferenced column 
descriptors and columns hanging around after schema evolution for tables.


This addresses bug HIVE-2246.
https://issues.apache.org/jira/browse/HIVE-2246


Diffs (updated)
-

  trunk/metastore/scripts/upgrade/derby/008-HIVE-2246.derby.sql PRE-CREATION 
  trunk/metastore/scripts/upgrade/derby/008-REVERT-HIVE-2246.derby.sql 
PRE-CREATION 
  trunk/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql PRE-CREATION 
  trunk/metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java 
1153927 
  trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 
1153927 
  
trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MColumnDescriptor.java
 PRE-CREATION 
  
trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MStorageDescriptor.java
 1153927 
  trunk/metastore/src/model/package.jdo 1153927 

Diff: https://reviews.apache.org/r/1183/diff


Testing
---

Passes facebook's regression testing and all existing test cases.  In one 
instance, before migration, the overhead involved with storage descriptors and 
columns was ~11 GB.  After migration, the overhead was ~1.5 GB.


Thanks,

Sohan



Re: Review Request: HIVE-2246: Dedupe tables' column schemas from partitions in the metastore db

2011-08-05 Thread Sohan Jain


 On 2011-07-25 06:46:04, Ning Zhang wrote:
  trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java, 
  line 1752
  https://reviews.apache.org/r/1183/diff/2/?file=26825#file26825line1752
 
  here do you check if the 'alter table' command changes the schema 
  (columns definition)? If it just set a table property, then you don't need 
  to create a new ColumnDescriptor right?
  
  Also if a table's schema got changed, a new CD will be created, but the 
  old partition will still have the old CDs. When we query the old partition, 
  do we use the old partitons's CD or the table's CD? 
  
  Also in the above case, when you run 'desc table partition 
  old_partition', do you return the old partition's CD or the table's CD?
 
 Sohan Jain wrote:
 Good point; I should check whether the table columns have changed; I do 
 this already when altering partitions.  I added that in the next diff.
 
 If a table's schema changes, it does not update existing partition CDs.  
 If we ever grab the partition object after the schema change, it will refer 
 to its old CD, not the table's CD.  However, when querying tables on the CLI, 
 we almost always use the table's set of columns.  E.g., if did:
  create table test (a string) partitioned by (p1 string, p2 string);
  alter table test add partition(p1=1, p2=1);
  # populate the p1=1, p2=1 partition with some data now
  alter table test add columns (b string)
  select * from test where p1 = 1 and p2 = 1,
 
 it'd use the table's latest schema; i.e., return the column 'a's values 
 and the column 'b' as all NULL.

Also, I fixed the desc table partition to use the partition's column schema, 
not the table's.


- Sohan


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/1183/#review1176
---


On 2011-07-22 05:30:29, Sohan Jain wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/1183/
 ---
 
 (Updated 2011-07-22 05:30:29)
 
 
 Review request for hive, Ning Zhang and Paul Yang.
 
 
 Summary
 ---
 
 This patch tries to make minimal changes to the API while keeping migration 
 short and somewhat easy to revert.
 
 The new schema can be described as follows:
 - CDS is a table corresponding to Column Descriptor objects.  Currently, it 
 only stores a CD_ID.
 - COLUMNS_V2 is a table corresponding to MFieldSchema objects, or columns.  A 
 Column Descriptor holds a list of columns.  COLUMNS_V2 has a foreign key to 
 the CD_ID to which it belongs.
 - SDS was modified to reference a Column Descriptor. So SDS now has a foreign 
 key to a CD_ID which describes its columns.
 
 During migration, we create Column Descriptors for tables in a 
 straightforward manner: their columns are now just wrapped inside a column 
 descriptor.  The SDS of partitions use their parent table's column 
 descriptor, since currently a partition and its table share the same list of 
 columns.
 
 When altering or adding a partition, give it it's parent table's column 
 descriptor IF the columns they describe are the same.  Otherwise, create a 
 new column descriptor for its columns.
 
 When adding or altering a table, create a new column descriptor every time.
 
 Whenever you drop a storage descriptor (e.g, when dropping tables or 
 partitions), check to see if the related column descriptor has any other 
 references in the table.  That is, check to see if any other storage 
 descriptors point to that column descriptor.  If none do, then delete that 
 column descriptor.  This check is in place so we don't have unreferenced 
 column descriptors and columns hanging around after schema evolution for 
 tables.
 
 
 This addresses bug HIVE-2246.
 https://issues.apache.org/jira/browse/HIVE-2246
 
 
 Diffs
 -
 
   trunk/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql PRE-CREATION 
   trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 
 1148945 
   
 trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MColumnDescriptor.java
  PRE-CREATION 
   
 trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MStorageDescriptor.java
  1148945 
   trunk/metastore/src/model/package.jdo 1148945 
 
 Diff: https://reviews.apache.org/r/1183/diff
 
 
 Testing
 ---
 
 Passes facebook's regression testing and all existing test cases.  In one 
 instance, before migration, the overhead involved with storage descriptors 
 and columns was ~11 GB.  After migration, the overhead was ~1.5 GB.
 
 
 Thanks,
 
 Sohan
 




Re: Review Request: HIVE-2246: Dedupe tables' column schemas from partitions in the metastore db

2011-08-05 Thread Sohan Jain

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/1183/
---

(Updated 2011-08-05 20:48:05.144312)


Review request for hive, Ning Zhang and Paul Yang.


Changes
---

-On alter table, only change the column descriptor if the columns have changed.
-Fix desc table partition... to use the partition's column schema, not the 
table's


Summary
---

This patch tries to make minimal changes to the API while keeping migration 
short and somewhat easy to revert.

The new schema can be described as follows:
- CDS is a table corresponding to Column Descriptor objects.  Currently, it 
only stores a CD_ID.
- COLUMNS_V2 is a table corresponding to MFieldSchema objects, or columns.  A 
Column Descriptor holds a list of columns.  COLUMNS_V2 has a foreign key to the 
CD_ID to which it belongs.
- SDS was modified to reference a Column Descriptor. So SDS now has a foreign 
key to a CD_ID which describes its columns.

During migration, we create Column Descriptors for tables in a straightforward 
manner: their columns are now just wrapped inside a column descriptor.  The SDS 
of partitions use their parent table's column descriptor, since currently a 
partition and its table share the same list of columns.

When altering or adding a partition, give it it's parent table's column 
descriptor IF the columns they describe are the same.  Otherwise, create a new 
column descriptor for its columns.

When adding or altering a table, create a new column descriptor every time.

Whenever you drop a storage descriptor (e.g, when dropping tables or 
partitions), check to see if the related column descriptor has any other 
references in the table.  That is, check to see if any other storage 
descriptors point to that column descriptor.  If none do, then delete that 
column descriptor.  This check is in place so we don't have unreferenced column 
descriptors and columns hanging around after schema evolution for tables.


This addresses bug HIVE-2246.
https://issues.apache.org/jira/browse/HIVE-2246


Diffs (updated)
-

  trunk/metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java 
1153927 
  trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 
1153927 
  
trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MStorageDescriptor.java
 1153927 
  trunk/metastore/src/model/package.jdo 1153927 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java 1153927 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/MetaDataFormatUtils.java 
1153927 

Diff: https://reviews.apache.org/r/1183/diff


Testing
---

Passes facebook's regression testing and all existing test cases.  In one 
instance, before migration, the overhead involved with storage descriptors and 
columns was ~11 GB.  After migration, the overhead was ~1.5 GB.


Thanks,

Sohan



Re: Review Request: HIVE-2246: Dedupe tables' column schemas from partitions in the metastore db

2011-08-05 Thread Sohan Jain

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/1183/
---

(Updated 2011-08-05 20:49:19.127572)


Review request for hive, Ning Zhang and Paul Yang.


Changes
---

-Forgot to add a few files.  NOTE: this is only a temporary diff; I need to add 
derby and postgres migration scripts.


Summary
---

This patch tries to make minimal changes to the API while keeping migration 
short and somewhat easy to revert.

The new schema can be described as follows:
- CDS is a table corresponding to Column Descriptor objects.  Currently, it 
only stores a CD_ID.
- COLUMNS_V2 is a table corresponding to MFieldSchema objects, or columns.  A 
Column Descriptor holds a list of columns.  COLUMNS_V2 has a foreign key to the 
CD_ID to which it belongs.
- SDS was modified to reference a Column Descriptor. So SDS now has a foreign 
key to a CD_ID which describes its columns.

During migration, we create Column Descriptors for tables in a straightforward 
manner: their columns are now just wrapped inside a column descriptor.  The SDS 
of partitions use their parent table's column descriptor, since currently a 
partition and its table share the same list of columns.

When altering or adding a partition, give it it's parent table's column 
descriptor IF the columns they describe are the same.  Otherwise, create a new 
column descriptor for its columns.

When adding or altering a table, create a new column descriptor every time.

Whenever you drop a storage descriptor (e.g, when dropping tables or 
partitions), check to see if the related column descriptor has any other 
references in the table.  That is, check to see if any other storage 
descriptors point to that column descriptor.  If none do, then delete that 
column descriptor.  This check is in place so we don't have unreferenced column 
descriptors and columns hanging around after schema evolution for tables.


This addresses bug HIVE-2246.
https://issues.apache.org/jira/browse/HIVE-2246


Diffs (updated)
-

  trunk/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql PRE-CREATION 
  trunk/metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java 
1153927 
  trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 
1153927 
  
trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MColumnDescriptor.java
 PRE-CREATION 
  
trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MStorageDescriptor.java
 1153927 
  trunk/metastore/src/model/package.jdo 1153927 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java 1153927 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/MetaDataFormatUtils.java 
1153927 

Diff: https://reviews.apache.org/r/1183/diff


Testing
---

Passes facebook's regression testing and all existing test cases.  In one 
instance, before migration, the overhead involved with storage descriptors and 
columns was ~11 GB.  After migration, the overhead was ~1.5 GB.


Thanks,

Sohan



Re: Review Request: HIVE-2246: Dedupe tables' column schemas from partitions in the metastore db

2011-08-05 Thread Paul Yang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/1183/#review1309
---


Also, can you add migration scripts for other DB's?


trunk/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql
https://reviews.apache.org/r/1183/#comment2982

Typo



trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java
https://reviews.apache.org/r/1183/#comment2979

The check and the delete should in the same transaction, as it's possible 
for a reference to a CD to be created after the check but before the delete.



trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java
https://reviews.apache.org/r/1183/#comment2981

How does this drop the storage descriptor?



trunk/metastore/src/model/package.jdo
https://reviews.apache.org/r/1183/#comment2968

Fix indent


- Paul


On 2011-08-05 20:49:19, Sohan Jain wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/1183/
 ---
 
 (Updated 2011-08-05 20:49:19)
 
 
 Review request for hive, Ning Zhang and Paul Yang.
 
 
 Summary
 ---
 
 This patch tries to make minimal changes to the API while keeping migration 
 short and somewhat easy to revert.
 
 The new schema can be described as follows:
 - CDS is a table corresponding to Column Descriptor objects.  Currently, it 
 only stores a CD_ID.
 - COLUMNS_V2 is a table corresponding to MFieldSchema objects, or columns.  A 
 Column Descriptor holds a list of columns.  COLUMNS_V2 has a foreign key to 
 the CD_ID to which it belongs.
 - SDS was modified to reference a Column Descriptor. So SDS now has a foreign 
 key to a CD_ID which describes its columns.
 
 During migration, we create Column Descriptors for tables in a 
 straightforward manner: their columns are now just wrapped inside a column 
 descriptor.  The SDS of partitions use their parent table's column 
 descriptor, since currently a partition and its table share the same list of 
 columns.
 
 When altering or adding a partition, give it it's parent table's column 
 descriptor IF the columns they describe are the same.  Otherwise, create a 
 new column descriptor for its columns.
 
 When adding or altering a table, create a new column descriptor every time.
 
 Whenever you drop a storage descriptor (e.g, when dropping tables or 
 partitions), check to see if the related column descriptor has any other 
 references in the table.  That is, check to see if any other storage 
 descriptors point to that column descriptor.  If none do, then delete that 
 column descriptor.  This check is in place so we don't have unreferenced 
 column descriptors and columns hanging around after schema evolution for 
 tables.
 
 
 This addresses bug HIVE-2246.
 https://issues.apache.org/jira/browse/HIVE-2246
 
 
 Diffs
 -
 
   trunk/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql PRE-CREATION 
   
 trunk/metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java 
 1153927 
   trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 
 1153927 
   
 trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MColumnDescriptor.java
  PRE-CREATION 
   
 trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MStorageDescriptor.java
  1153927 
   trunk/metastore/src/model/package.jdo 1153927 
   trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java 1153927 
   
 trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/MetaDataFormatUtils.java 
 1153927 
 
 Diff: https://reviews.apache.org/r/1183/diff
 
 
 Testing
 ---
 
 Passes facebook's regression testing and all existing test cases.  In one 
 instance, before migration, the overhead involved with storage descriptors 
 and columns was ~11 GB.  After migration, the overhead was ~1.5 GB.
 
 
 Thanks,
 
 Sohan
 




Re: Review Request: HIVE-2246: Dedupe tables' column schemas from partitions in the metastore db

2011-08-05 Thread Sohan Jain

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/1183/
---

(Updated 2011-08-06 01:40:49.118616)


Review request for hive, Ning Zhang and Paul Yang.


Changes
---

-made listStorageDescriptors.. into one transaction
-renamed dropStorageDescriptorCleanly to make it's functionality clearer
-indents  typo


Summary
---

This patch tries to make minimal changes to the API while keeping migration 
short and somewhat easy to revert.

The new schema can be described as follows:
- CDS is a table corresponding to Column Descriptor objects.  Currently, it 
only stores a CD_ID.
- COLUMNS_V2 is a table corresponding to MFieldSchema objects, or columns.  A 
Column Descriptor holds a list of columns.  COLUMNS_V2 has a foreign key to the 
CD_ID to which it belongs.
- SDS was modified to reference a Column Descriptor. So SDS now has a foreign 
key to a CD_ID which describes its columns.

During migration, we create Column Descriptors for tables in a straightforward 
manner: their columns are now just wrapped inside a column descriptor.  The SDS 
of partitions use their parent table's column descriptor, since currently a 
partition and its table share the same list of columns.

When altering or adding a partition, give it it's parent table's column 
descriptor IF the columns they describe are the same.  Otherwise, create a new 
column descriptor for its columns.

When adding or altering a table, create a new column descriptor every time.

Whenever you drop a storage descriptor (e.g, when dropping tables or 
partitions), check to see if the related column descriptor has any other 
references in the table.  That is, check to see if any other storage 
descriptors point to that column descriptor.  If none do, then delete that 
column descriptor.  This check is in place so we don't have unreferenced column 
descriptors and columns hanging around after schema evolution for tables.


This addresses bug HIVE-2246.
https://issues.apache.org/jira/browse/HIVE-2246


Diffs (updated)
-

  trunk/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql PRE-CREATION 
  trunk/metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java 
1153927 
  trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 
1153927 
  
trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MColumnDescriptor.java
 PRE-CREATION 
  
trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MStorageDescriptor.java
 1153927 
  trunk/metastore/src/model/package.jdo 1153927 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java 1153927 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/MetaDataFormatUtils.java 
1153927 

Diff: https://reviews.apache.org/r/1183/diff


Testing
---

Passes facebook's regression testing and all existing test cases.  In one 
instance, before migration, the overhead involved with storage descriptors and 
columns was ~11 GB.  After migration, the overhead was ~1.5 GB.


Thanks,

Sohan



Re: Review Request: HIVE-2246: Dedupe tables' column schemas from partitions in the metastore db

2011-08-05 Thread Sohan Jain

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/1183/#review1313
---



trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java
https://reviews.apache.org/r/1183/#comment2984

should read 1-N actually


- Sohan


On 2011-08-06 01:40:49, Sohan Jain wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/1183/
 ---
 
 (Updated 2011-08-06 01:40:49)
 
 
 Review request for hive, Ning Zhang and Paul Yang.
 
 
 Summary
 ---
 
 This patch tries to make minimal changes to the API while keeping migration 
 short and somewhat easy to revert.
 
 The new schema can be described as follows:
 - CDS is a table corresponding to Column Descriptor objects.  Currently, it 
 only stores a CD_ID.
 - COLUMNS_V2 is a table corresponding to MFieldSchema objects, or columns.  A 
 Column Descriptor holds a list of columns.  COLUMNS_V2 has a foreign key to 
 the CD_ID to which it belongs.
 - SDS was modified to reference a Column Descriptor. So SDS now has a foreign 
 key to a CD_ID which describes its columns.
 
 During migration, we create Column Descriptors for tables in a 
 straightforward manner: their columns are now just wrapped inside a column 
 descriptor.  The SDS of partitions use their parent table's column 
 descriptor, since currently a partition and its table share the same list of 
 columns.
 
 When altering or adding a partition, give it it's parent table's column 
 descriptor IF the columns they describe are the same.  Otherwise, create a 
 new column descriptor for its columns.
 
 When adding or altering a table, create a new column descriptor every time.
 
 Whenever you drop a storage descriptor (e.g, when dropping tables or 
 partitions), check to see if the related column descriptor has any other 
 references in the table.  That is, check to see if any other storage 
 descriptors point to that column descriptor.  If none do, then delete that 
 column descriptor.  This check is in place so we don't have unreferenced 
 column descriptors and columns hanging around after schema evolution for 
 tables.
 
 
 This addresses bug HIVE-2246.
 https://issues.apache.org/jira/browse/HIVE-2246
 
 
 Diffs
 -
 
   trunk/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql PRE-CREATION 
   
 trunk/metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java 
 1153927 
   trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 
 1153927 
   
 trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MColumnDescriptor.java
  PRE-CREATION 
   
 trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MStorageDescriptor.java
  1153927 
   trunk/metastore/src/model/package.jdo 1153927 
   trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java 1153927 
   
 trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/MetaDataFormatUtils.java 
 1153927 
 
 Diff: https://reviews.apache.org/r/1183/diff
 
 
 Testing
 ---
 
 Passes facebook's regression testing and all existing test cases.  In one 
 instance, before migration, the overhead involved with storage descriptors 
 and columns was ~11 GB.  After migration, the overhead was ~1.5 GB.
 
 
 Thanks,
 
 Sohan
 




Re: Review Request: HIVE-2246: Dedupe tables' column schemas from partitions in the metastore db

2011-07-25 Thread Ning Zhang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/1183/#review1176
---



trunk/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql
https://reviews.apache.org/r/1183/#comment2467

is the CHARSET (latin1) the same as SDS? This will require the user's 
comments to be in latin1 which prevents UTF chars.



trunk/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql
https://reviews.apache.org/r/1183/#comment2466

can you also add migration script for derby? we support derby as a default 
metastore RDBMS as well. 



trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java
https://reviews.apache.org/r/1183/#comment2468

here do you check if the 'alter table' command changes the schema (columns 
definition)? If it just set a table property, then you don't need to create a 
new ColumnDescriptor right?

Also if a table's schema got changed, a new CD will be created, but the old 
partition will still have the old CDs. When we query the old partition, do we 
use the old partitons's CD or the table's CD? 

Also in the above case, when you run 'desc table partition 
old_partition', do you return the old partition's CD or the table's CD? 


- Ning


On 2011-07-22 05:30:29, Sohan Jain wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/1183/
 ---
 
 (Updated 2011-07-22 05:30:29)
 
 
 Review request for hive, Ning Zhang and Paul Yang.
 
 
 Summary
 ---
 
 This patch tries to make minimal changes to the API while keeping migration 
 short and somewhat easy to revert.
 
 The new schema can be described as follows:
 - CDS is a table corresponding to Column Descriptor objects.  Currently, it 
 only stores a CD_ID.
 - COLUMNS_V2 is a table corresponding to MFieldSchema objects, or columns.  A 
 Column Descriptor holds a list of columns.  COLUMNS_V2 has a foreign key to 
 the CD_ID to which it belongs.
 - SDS was modified to reference a Column Descriptor. So SDS now has a foreign 
 key to a CD_ID which describes its columns.
 
 During migration, we create Column Descriptors for tables in a 
 straightforward manner: their columns are now just wrapped inside a column 
 descriptor.  The SDS of partitions use their parent table's column 
 descriptor, since currently a partition and its table share the same list of 
 columns.
 
 When altering or adding a partition, give it it's parent table's column 
 descriptor IF the columns they describe are the same.  Otherwise, create a 
 new column descriptor for its columns.
 
 When adding or altering a table, create a new column descriptor every time.
 
 Whenever you drop a storage descriptor (e.g, when dropping tables or 
 partitions), check to see if the related column descriptor has any other 
 references in the table.  That is, check to see if any other storage 
 descriptors point to that column descriptor.  If none do, then delete that 
 column descriptor.  This check is in place so we don't have unreferenced 
 column descriptors and columns hanging around after schema evolution for 
 tables.
 
 
 This addresses bug HIVE-2246.
 https://issues.apache.org/jira/browse/HIVE-2246
 
 
 Diffs
 -
 
   trunk/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql PRE-CREATION 
   trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 
 1148945 
   
 trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MColumnDescriptor.java
  PRE-CREATION 
   
 trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MStorageDescriptor.java
  1148945 
   trunk/metastore/src/model/package.jdo 1148945 
 
 Diff: https://reviews.apache.org/r/1183/diff
 
 
 Testing
 ---
 
 Passes facebook's regression testing and all existing test cases.  In one 
 instance, before migration, the overhead involved with storage descriptors 
 and columns was ~11 GB.  After migration, the overhead was ~1.5 GB.
 
 
 Thanks,
 
 Sohan
 




Review Request: HIVE-2246: Dedupe tables' column schemas from partitions in the metastore db

2011-07-21 Thread Sohan Jain

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/1183/
---

Review request for hive, Ning Zhang and Paul Yang.


Summary
---

This patch tries to make minimal changes to the API while keeping migration 
short and somewhat easy to revert.

The new schema can be described as follows:
- CDS is a table corresponding to Column Descriptor objects.  Currently, it 
only stores a CD_ID.
- COLUMNS_V2 is a table corresponding to MFieldSchema objects, or columns.  A 
Column Descriptor holds a list of columns.  COLUMNS_V2 has a foreign key to the 
CD_ID to which it belongs.
- SDS was modified to reference a Column Descriptor. So SDS now has a foreign 
key to a CD_ID which describes its columns.

During migration, we create Column Descriptors for tables in a straightforward 
manner: their columns are now just wrapped inside a column descriptor.  The SDS 
of partitions use their parent table's column descriptor, since currently a 
partition and its table share the same list of columns.

When altering or adding a partition, give it it's parent table's column 
descriptor IF the columns they describe are the same.  Otherwise, create a new 
column descriptor for its columns.

When adding or altering a table, create a new column descriptor every time.

Whenever you drop a storage descriptor (e.g, when dropping tables or 
partitions), check to see if the related column descriptor has any other 
references in the table.  That is, check to see if any other storage 
descriptors point to that column descriptor.  If none do, then delete that 
column descriptor.  This check is in place so we don't have unreferenced column 
descriptors and columns hanging around after schema evolution for tables.


This addresses bug HIVE-2246.
https://issues.apache.org/jira/browse/HIVE-2246


Diffs
-

  trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 
1148945 
  
trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MStorageDescriptor.java
 1148945 
  trunk/metastore/src/model/package.jdo 1148945 

Diff: https://reviews.apache.org/r/1183/diff


Testing
---

Passes facebook's regression testing and all existing test cases.  In one 
instance, before migration, the overhead involved with storage descriptors and 
columns was ~11 GB.  After migration, the overhead was ~1.5 GB.


Thanks,

Sohan



Re: Review Request: HIVE-2246: Dedupe tables' column schemas from partitions in the metastore db

2011-07-21 Thread Sohan Jain

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/1183/
---

(Updated 2011-07-22 05:30:29.026246)


Review request for hive, Ning Zhang and Paul Yang.


Changes
---

Adding some files I missed in the last diff.


Summary
---

This patch tries to make minimal changes to the API while keeping migration 
short and somewhat easy to revert.

The new schema can be described as follows:
- CDS is a table corresponding to Column Descriptor objects.  Currently, it 
only stores a CD_ID.
- COLUMNS_V2 is a table corresponding to MFieldSchema objects, or columns.  A 
Column Descriptor holds a list of columns.  COLUMNS_V2 has a foreign key to the 
CD_ID to which it belongs.
- SDS was modified to reference a Column Descriptor. So SDS now has a foreign 
key to a CD_ID which describes its columns.

During migration, we create Column Descriptors for tables in a straightforward 
manner: their columns are now just wrapped inside a column descriptor.  The SDS 
of partitions use their parent table's column descriptor, since currently a 
partition and its table share the same list of columns.

When altering or adding a partition, give it it's parent table's column 
descriptor IF the columns they describe are the same.  Otherwise, create a new 
column descriptor for its columns.

When adding or altering a table, create a new column descriptor every time.

Whenever you drop a storage descriptor (e.g, when dropping tables or 
partitions), check to see if the related column descriptor has any other 
references in the table.  That is, check to see if any other storage 
descriptors point to that column descriptor.  If none do, then delete that 
column descriptor.  This check is in place so we don't have unreferenced column 
descriptors and columns hanging around after schema evolution for tables.


This addresses bug HIVE-2246.
https://issues.apache.org/jira/browse/HIVE-2246


Diffs (updated)
-

  trunk/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql PRE-CREATION 
  trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 
1148945 
  
trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MColumnDescriptor.java
 PRE-CREATION 
  
trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MStorageDescriptor.java
 1148945 
  trunk/metastore/src/model/package.jdo 1148945 

Diff: https://reviews.apache.org/r/1183/diff


Testing
---

Passes facebook's regression testing and all existing test cases.  In one 
instance, before migration, the overhead involved with storage descriptors and 
columns was ~11 GB.  After migration, the overhead was ~1.5 GB.


Thanks,

Sohan



Review Request: HIVE-2246: Dedupe tables' column schemas from partitions in the metastore db

2011-06-30 Thread Sohan Jain

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/985/
---

Review request for hive.


Summary
---

We can re-organize the JDO models to reduce space usage to keep the metastore 
scalable for the future. Currently, partitions are the fastest growing objects 
in the metastore, and the metastore keeps a separate copy of the columns list 
for each partition. We can normalize the metastore db by decoupling Columns 
from Storage Descriptors and not storing duplicate lists of the columns for 
each partition.

An idea is to create an additional level of indirection with a Column 
Descriptor that has a list of columns. A table has a reference to its latest 
Column Descriptor (note: a table may have more than one Column Descriptor in 
the case of schema evolution). Partitions and Indexes can reference the same 
Column Descriptors as their parent table.

Currently, the COLUMNS table in the metastore has roughly (number of partitions 
+ number of tables) * (average number of columns pertable) rows. We can reduce 
this to (number of tables) * (average number of columns per table) rows, while 
incurring a small cost proportional to the number of tables to store the Column 
Descriptors.


This addresses bug HIVE-2246.
https://issues.apache.org/jira/browse/HIVE-2246


Diffs
-

  trunk/metastore/if/hive_metastore.thrift 1140399 
  
trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MColumnDescriptor.java
 PRE-CREATION 
  
trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MDatabase.java 
1140399 
  
trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MFieldSchema.java
 1140399 
  trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MIndex.java 
1140399 
  
trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MPartition.java
 1140399 
  
trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MStorageDescriptor.java
 1140399 
  trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MTable.java 
1140399 
  trunk/metastore/src/model/package.jdo 1140399 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java 1140399 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/index/TableBasedIndexHandler.java 
1140399 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/index/bitmap/BitmapIndexHandler.java
 1140399 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/index/compact/CompactIndexHandler.java
 1140399 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 1140399 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/MetaDataFormatUtils.java 
1140399 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/Partition.java 1140399 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/Table.java 1140399 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java 
1140399 

Diff: https://reviews.apache.org/r/985/diff


Testing
---

Haven't run any unit tests yet, just qualitative testing so far.


Thanks,

Sohan