[jira] [Updated] (HIVE-2246) Dedupe tables' column schemas from partitions in the metastore db

2011-08-08 Thread Sohan Jain (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-2246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sohan Jain updated HIVE-2246:
-

Attachment: HIVE-2246.8.patch

 Dedupe tables' column schemas from partitions in the metastore db
 -

 Key: HIVE-2246
 URL: https://issues.apache.org/jira/browse/HIVE-2246
 Project: Hive
  Issue Type: Improvement
  Components: Metastore
Reporter: Sohan Jain
Assignee: Sohan Jain
 Attachments: HIVE-2246.2.patch, HIVE-2246.3.patch, HIVE-2246.4.patch, 
 HIVE-2246.8.patch


 Note: this patch proposes a schema change, and is therefore incompatible with 
 the current metastore.
 We can re-organize the JDO models to reduce space usage to keep the metastore 
 scalable for the future.  Currently, partitions are the fastest growing 
 objects in the metastore, and the metastore keeps a separate copy of the 
 columns list for each partition.  We can normalize the metastore db by 
 decoupling Columns from Storage Descriptors and not storing duplicate lists 
 of the columns for each partition. 
 An idea is to create an additional level of indirection with a Column 
 Descriptor that has a list of columns.  A table has a reference to its 
 latest Column Descriptor (note: a table may have more than one Column 
 Descriptor in the case of schema evolution).  Partitions and Indexes can 
 reference the same Column Descriptors as their parent table.
 Currently, the COLUMNS table in the metastore has roughly (number of 
 partitions + number of tables) * (average number of columns pertable) rows.  
 We can reduce this to (number of tables) * (average number of columns per 
 table) rows, while incurring a small cost proportional to the number of 
 tables to store the Column Descriptors.
 Please see the latest review board for additional implementation details.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HIVE-2246) Dedupe tables' column schemas from partitions in the metastore db

2011-08-05 Thread Sohan Jain (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-2246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sohan Jain updated HIVE-2246:
-

Attachment: HIVE-2246.4.patch

 Dedupe tables' column schemas from partitions in the metastore db
 -

 Key: HIVE-2246
 URL: https://issues.apache.org/jira/browse/HIVE-2246
 Project: Hive
  Issue Type: Improvement
  Components: Metastore
Reporter: Sohan Jain
Assignee: Sohan Jain
 Attachments: HIVE-2246.2.patch, HIVE-2246.3.patch, HIVE-2246.4.patch


 Note: this patch proposes a schema change, and is therefore incompatible with 
 the current metastore.
 We can re-organize the JDO models to reduce space usage to keep the metastore 
 scalable for the future.  Currently, partitions are the fastest growing 
 objects in the metastore, and the metastore keeps a separate copy of the 
 columns list for each partition.  We can normalize the metastore db by 
 decoupling Columns from Storage Descriptors and not storing duplicate lists 
 of the columns for each partition. 
 An idea is to create an additional level of indirection with a Column 
 Descriptor that has a list of columns.  A table has a reference to its 
 latest Column Descriptor (note: a table may have more than one Column 
 Descriptor in the case of schema evolution).  Partitions and Indexes can 
 reference the same Column Descriptors as their parent table.
 Currently, the COLUMNS table in the metastore has roughly (number of 
 partitions + number of tables) * (average number of columns pertable) rows.  
 We can reduce this to (number of tables) * (average number of columns per 
 table) rows, while incurring a small cost proportional to the number of 
 tables to store the Column Descriptors.
 Please see the latest review board for additional implementation details.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HIVE-2246) Dedupe tables' column schemas from partitions in the metastore db

2011-07-21 Thread Sohan Jain (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-2246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sohan Jain updated HIVE-2246:
-

Description: 
Note: this patch proposes a schema change, and is therefore incompatible with 
the current metastore.

We can re-organize the JDO models to reduce space usage to keep the metastore 
scalable for the future.  Currently, partitions are the fastest growing objects 
in the metastore, and the metastore keeps a separate copy of the columns list 
for each partition.  We can normalize the metastore db by decoupling Columns 
from Storage Descriptors and not storing duplicate lists of the columns for 
each partition. 

An idea is to create an additional level of indirection with a Column 
Descriptor that has a list of columns.  A table has a reference to its latest 
Column Descriptor (note: a table may have more than one Column Descriptor in 
the case of schema evolution).  Partitions and Indexes can reference the same 
Column Descriptors as their parent table.

Currently, the COLUMNS table in the metastore has roughly (number of partitions 
+ number of tables) * (average number of columns pertable) rows.  We can reduce 
this to (number of tables) * (average number of columns per table) rows, while 
incurring a small cost proportional to the number of tables to store the Column 
Descriptors.

Please see the latest review board for additional implementation details.

  was:
We can re-organize the JDO models to reduce space usage to keep the metastore 
scalable for the future.  Currently, partitions are the fastest growing objects 
in the metastore, and the metastore keeps a separate copy of the columns list 
for each partition.  We can normalize the metastore db by decoupling Columns 
from Storage Descriptors and not storing duplicate lists of the columns for 
each partition. 

An idea is to create an additional level of indirection with a Column 
Descriptor that has a list of columns.  A table has a reference to its latest 
Column Descriptor (note: a table may have more than one Column Descriptor in 
the case of schema evolution).  Partitions and Indexes can reference the same 
Column Descriptors as their parent table.

Currently, the COLUMNS table in the metastore has roughly (number of partitions 
+ number of tables) * (average number of columns pertable) rows.  We can reduce 
this to (number of tables) * (average number of columns per table) rows, while 
incurring a small cost proportional to the number of tables to store the Column 
Descriptors.

   Tags: metastore, schema, JDO

 Dedupe tables' column schemas from partitions in the metastore db
 -

 Key: HIVE-2246
 URL: https://issues.apache.org/jira/browse/HIVE-2246
 Project: Hive
  Issue Type: Improvement
  Components: Metastore
Reporter: Sohan Jain
Assignee: Sohan Jain
 Attachments: HIVE-2246.2.patch


 Note: this patch proposes a schema change, and is therefore incompatible with 
 the current metastore.
 We can re-organize the JDO models to reduce space usage to keep the metastore 
 scalable for the future.  Currently, partitions are the fastest growing 
 objects in the metastore, and the metastore keeps a separate copy of the 
 columns list for each partition.  We can normalize the metastore db by 
 decoupling Columns from Storage Descriptors and not storing duplicate lists 
 of the columns for each partition. 
 An idea is to create an additional level of indirection with a Column 
 Descriptor that has a list of columns.  A table has a reference to its 
 latest Column Descriptor (note: a table may have more than one Column 
 Descriptor in the case of schema evolution).  Partitions and Indexes can 
 reference the same Column Descriptors as their parent table.
 Currently, the COLUMNS table in the metastore has roughly (number of 
 partitions + number of tables) * (average number of columns pertable) rows.  
 We can reduce this to (number of tables) * (average number of columns per 
 table) rows, while incurring a small cost proportional to the number of 
 tables to store the Column Descriptors.
 Please see the latest review board for additional implementation details.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HIVE-2246) Dedupe tables' column schemas from partitions in the metastore db

2011-07-21 Thread Sohan Jain (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-2246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sohan Jain updated HIVE-2246:
-

Attachment: HIVE-2246.3.patch

Adding some missing files that I forgot to svn add

 Dedupe tables' column schemas from partitions in the metastore db
 -

 Key: HIVE-2246
 URL: https://issues.apache.org/jira/browse/HIVE-2246
 Project: Hive
  Issue Type: Improvement
  Components: Metastore
Reporter: Sohan Jain
Assignee: Sohan Jain
 Attachments: HIVE-2246.2.patch, HIVE-2246.3.patch


 Note: this patch proposes a schema change, and is therefore incompatible with 
 the current metastore.
 We can re-organize the JDO models to reduce space usage to keep the metastore 
 scalable for the future.  Currently, partitions are the fastest growing 
 objects in the metastore, and the metastore keeps a separate copy of the 
 columns list for each partition.  We can normalize the metastore db by 
 decoupling Columns from Storage Descriptors and not storing duplicate lists 
 of the columns for each partition. 
 An idea is to create an additional level of indirection with a Column 
 Descriptor that has a list of columns.  A table has a reference to its 
 latest Column Descriptor (note: a table may have more than one Column 
 Descriptor in the case of schema evolution).  Partitions and Indexes can 
 reference the same Column Descriptors as their parent table.
 Currently, the COLUMNS table in the metastore has roughly (number of 
 partitions + number of tables) * (average number of columns pertable) rows.  
 We can reduce this to (number of tables) * (average number of columns per 
 table) rows, while incurring a small cost proportional to the number of 
 tables to store the Column Descriptors.
 Please see the latest review board for additional implementation details.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira