[jira] [Commented] (HIVE-3959) Update Partition Statistics in Metastore Layer
[ https://issues.apache.org/jira/browse/HIVE-3959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13781438#comment-13781438 ] Gang Tim Liu commented on HIVE-3959: Yes,assign it to dilip Update Partition Statistics in Metastore Layer -- Key: HIVE-3959 URL: https://issues.apache.org/jira/browse/HIVE-3959 Project: Hive Issue Type: Improvement Components: Metastore, Statistics Reporter: Bhushan Mandhani Assignee: Dilip Joseph Priority: Minor Attachments: HIVE-3959.patch.1, HIVE-3959.patch.11.txt, HIVE-3959.patch.12.txt, HIVE-3959.patch.2 When partitions are created using queries (insert overwrite and insert into) then the StatsTask updates all stats. However, when partitions are added directly through metadata-only partitions (either CLI or direct calls to Thrift Metastore) no stats are populated even if hive.stats.reliable is set to true. This puts us in a situation where we can't decide if stats are truly reliable or not. We propose that the fast stats (numFiles and totalSize) which don't require a scan of the data should always be populated and be completely reliable. For now we are still excluding rowCount and rawDataSize because that will make these operations very expensive. Currently they are quick metadata-only ops. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Assigned] (HIVE-3959) Update Partition Statistics in Metastore Layer
[ https://issues.apache.org/jira/browse/HIVE-3959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Tim Liu reassigned HIVE-3959: -- Assignee: Dilip Joseph (was: Gang Tim Liu) Update Partition Statistics in Metastore Layer -- Key: HIVE-3959 URL: https://issues.apache.org/jira/browse/HIVE-3959 Project: Hive Issue Type: Improvement Components: Metastore, Statistics Reporter: Bhushan Mandhani Assignee: Dilip Joseph Priority: Minor Attachments: HIVE-3959.patch.1, HIVE-3959.patch.11.txt, HIVE-3959.patch.12.txt, HIVE-3959.patch.2 When partitions are created using queries (insert overwrite and insert into) then the StatsTask updates all stats. However, when partitions are added directly through metadata-only partitions (either CLI or direct calls to Thrift Metastore) no stats are populated even if hive.stats.reliable is set to true. This puts us in a situation where we can't decide if stats are truly reliable or not. We propose that the fast stats (numFiles and totalSize) which don't require a scan of the data should always be populated and be completely reliable. For now we are still excluding rowCount and rawDataSize because that will make these operations very expensive. Currently they are quick metadata-only ops. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Assigned] (HIVE-3745) Hive does improper = based string comparisons for strings with trailing whitespaces
[ https://issues.apache.org/jira/browse/HIVE-3745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Tim Liu reassigned HIVE-3745: -- Assignee: Kevin Wilfong (was: Gang Tim Liu) Hive does improper = based string comparisons for strings with trailing whitespaces - Key: HIVE-3745 URL: https://issues.apache.org/jira/browse/HIVE-3745 Project: Hive Issue Type: Bug Components: SQL Affects Versions: 0.9.0 Reporter: Harsh J Assignee: Kevin Wilfong Compared to other systems such as DB2, MySQL, etc., which disregard trailing whitespaces in a string used when comparing two strings with the {{=}} relational operator, Hive does not do this. For example, note the following line from the MySQL manual: http://dev.mysql.com/doc/refman/5.1/en/char.html {quote} All MySQL collations are of type PADSPACE. This means that all CHAR and VARCHAR values in MySQL are compared without regard to any trailing spaces. {quote} Hive still is whitespace sensitive and regards trailing spaces of a string as worthy elements when comparing. Ideally {{LIKE}} should consider this strongly, but {{=}} should not. Is there a specific reason behind this difference of implementation in Hive's SQL? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-3949) Some test failures in hadoop 23
[ https://issues.apache.org/jira/browse/HIVE-3949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13680052#comment-13680052 ] Gang Tim Liu commented on HIVE-3949: sure, please feel free to work on it. thanks Some test failures in hadoop 23 --- Key: HIVE-3949 URL: https://issues.apache.org/jira/browse/HIVE-3949 Project: Hive Issue Type: Bug Reporter: Gang Tim Liu Assignee: Gang Tim Liu This is follow up on hive-3873. We have fixed some test failures in 3873 and a few other jira issues. We will use this jira to track the rest failures: https://builds.apache.org/job/Hive-trunk-hadoop2/ -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HIVE-3949) Some test failures in hadoop 23
[ https://issues.apache.org/jira/browse/HIVE-3949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Tim Liu reassigned HIVE-3949: -- Assignee: Brock Noland (was: Gang Tim Liu) Some test failures in hadoop 23 --- Key: HIVE-3949 URL: https://issues.apache.org/jira/browse/HIVE-3949 Project: Hive Issue Type: Bug Reporter: Gang Tim Liu Assignee: Brock Noland This is follow up on hive-3873. We have fixed some test failures in 3873 and a few other jira issues. We will use this jira to track the rest failures: https://builds.apache.org/job/Hive-trunk-hadoop2/ -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4474) Column access not tracked properly for partitioned tables
[ https://issues.apache.org/jira/browse/HIVE-4474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13648537#comment-13648537 ] Gang Tim Liu commented on HIVE-4474: Committed. thank Samuel Yuan Column access not tracked properly for partitioned tables - Key: HIVE-4474 URL: https://issues.apache.org/jira/browse/HIVE-4474 Project: Hive Issue Type: Bug Components: Query Processor Reporter: Samuel Yuan Assignee: Samuel Yuan Attachments: HIVE-4474.1.patch.txt The columns recorded as being accessed is incorrect for partitioned tables. The index of accessed columns is a position in the list of non-partition columns, but a list of all columns is being used right now to do the lookup. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-3959) Update Partition Statistics in Metastore Layer
[ https://issues.apache.org/jira/browse/HIVE-3959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Tim Liu updated HIVE-3959: --- Attachment: (was: HIVE-3959.patch.9.txt) Update Partition Statistics in Metastore Layer -- Key: HIVE-3959 URL: https://issues.apache.org/jira/browse/HIVE-3959 Project: Hive Issue Type: Improvement Components: Metastore, Statistics Reporter: Bhushan Mandhani Assignee: Gang Tim Liu Priority: Minor Attachments: HIVE-3959.patch.1, HIVE-3959.patch.11.txt, HIVE-3959.patch.12.txt, HIVE-3959.patch.2 When partitions are created using queries (insert overwrite and insert into) then the StatsTask updates all stats. However, when partitions are added directly through metadata-only partitions (either CLI or direct calls to Thrift Metastore) no stats are populated even if hive.stats.reliable is set to true. This puts us in a situation where we can't decide if stats are truly reliable or not. We propose that the fast stats (numFiles and totalSize) which don't require a scan of the data should always be populated and be completely reliable. For now we are still excluding rowCount and rawDataSize because that will make these operations very expensive. Currently they are quick metadata-only ops. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-3959) Update Partition Statistics in Metastore Layer
[ https://issues.apache.org/jira/browse/HIVE-3959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Tim Liu updated HIVE-3959: --- Attachment: HIVE-3959.patch.12.txt Update Partition Statistics in Metastore Layer -- Key: HIVE-3959 URL: https://issues.apache.org/jira/browse/HIVE-3959 Project: Hive Issue Type: Improvement Components: Metastore, Statistics Reporter: Bhushan Mandhani Assignee: Gang Tim Liu Priority: Minor Attachments: HIVE-3959.patch.1, HIVE-3959.patch.11.txt, HIVE-3959.patch.12.txt, HIVE-3959.patch.2 When partitions are created using queries (insert overwrite and insert into) then the StatsTask updates all stats. However, when partitions are added directly through metadata-only partitions (either CLI or direct calls to Thrift Metastore) no stats are populated even if hive.stats.reliable is set to true. This puts us in a situation where we can't decide if stats are truly reliable or not. We propose that the fast stats (numFiles and totalSize) which don't require a scan of the data should always be populated and be completely reliable. For now we are still excluding rowCount and rawDataSize because that will make these operations very expensive. Currently they are quick metadata-only ops. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Work started] (HIVE-3959) Update Partition Statistics in Metastore Layer
[ https://issues.apache.org/jira/browse/HIVE-3959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HIVE-3959 started by Gang Tim Liu. Update Partition Statistics in Metastore Layer -- Key: HIVE-3959 URL: https://issues.apache.org/jira/browse/HIVE-3959 Project: Hive Issue Type: Improvement Components: Metastore, Statistics Reporter: Bhushan Mandhani Assignee: Gang Tim Liu Priority: Minor Attachments: HIVE-3959.patch.1, HIVE-3959.patch.11.txt, HIVE-3959.patch.12.txt, HIVE-3959.patch.2 When partitions are created using queries (insert overwrite and insert into) then the StatsTask updates all stats. However, when partitions are added directly through metadata-only partitions (either CLI or direct calls to Thrift Metastore) no stats are populated even if hive.stats.reliable is set to true. This puts us in a situation where we can't decide if stats are truly reliable or not. We propose that the fast stats (numFiles and totalSize) which don't require a scan of the data should always be populated and be completely reliable. For now we are still excluding rowCount and rawDataSize because that will make these operations very expensive. Currently they are quick metadata-only ops. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-3959) Update Partition Statistics in Metastore Layer
[ https://issues.apache.org/jira/browse/HIVE-3959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Tim Liu updated HIVE-3959: --- Status: Patch Available (was: In Progress) Update Partition Statistics in Metastore Layer -- Key: HIVE-3959 URL: https://issues.apache.org/jira/browse/HIVE-3959 Project: Hive Issue Type: Improvement Components: Metastore, Statistics Reporter: Bhushan Mandhani Assignee: Gang Tim Liu Priority: Minor Attachments: HIVE-3959.patch.1, HIVE-3959.patch.11.txt, HIVE-3959.patch.12.txt, HIVE-3959.patch.2 When partitions are created using queries (insert overwrite and insert into) then the StatsTask updates all stats. However, when partitions are added directly through metadata-only partitions (either CLI or direct calls to Thrift Metastore) no stats are populated even if hive.stats.reliable is set to true. This puts us in a situation where we can't decide if stats are truly reliable or not. We propose that the fast stats (numFiles and totalSize) which don't require a scan of the data should always be populated and be completely reliable. For now we are still excluding rowCount and rawDataSize because that will make these operations very expensive. Currently they are quick metadata-only ops. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4474) Column access not tracked properly for partitioned tables
[ https://issues.apache.org/jira/browse/HIVE-4474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13647763#comment-13647763 ] Gang Tim Liu commented on HIVE-4474: running test. Column access not tracked properly for partitioned tables - Key: HIVE-4474 URL: https://issues.apache.org/jira/browse/HIVE-4474 Project: Hive Issue Type: Bug Components: Query Processor Reporter: Samuel Yuan Assignee: Samuel Yuan Attachments: HIVE-4474.1.patch.txt The columns recorded as being accessed is incorrect for partitioned tables. The index of accessed columns is a position in the list of non-partition columns, but a list of all columns is being used right now to do the lookup. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-3959) Update Partition Statistics in Metastore Layer
[ https://issues.apache.org/jira/browse/HIVE-3959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Tim Liu updated HIVE-3959: --- Attachment: HIVE-3959.patch.11.txt Update Partition Statistics in Metastore Layer -- Key: HIVE-3959 URL: https://issues.apache.org/jira/browse/HIVE-3959 Project: Hive Issue Type: Improvement Components: Metastore, Statistics Reporter: Bhushan Mandhani Assignee: Gang Tim Liu Priority: Minor Attachments: HIVE-3959.patch.1, HIVE-3959.patch.11.txt, HIVE-3959.patch.2, HIVE-3959.patch.9.txt When partitions are created using queries (insert overwrite and insert into) then the StatsTask updates all stats. However, when partitions are added directly through metadata-only partitions (either CLI or direct calls to Thrift Metastore) no stats are populated even if hive.stats.reliable is set to true. This puts us in a situation where we can't decide if stats are truly reliable or not. We propose that the fast stats (numFiles and totalSize) which don't require a scan of the data should always be populated and be completely reliable. For now we are still excluding rowCount and rawDataSize because that will make these operations very expensive. Currently they are quick metadata-only ops. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-3959) Update Partition Statistics in Metastore Layer
[ https://issues.apache.org/jira/browse/HIVE-3959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Tim Liu updated HIVE-3959: --- Attachment: (was: HIVE-3959.patch.2.nohcat) Update Partition Statistics in Metastore Layer -- Key: HIVE-3959 URL: https://issues.apache.org/jira/browse/HIVE-3959 Project: Hive Issue Type: Improvement Components: Metastore, Statistics Reporter: Bhushan Mandhani Assignee: Gang Tim Liu Priority: Minor Attachments: HIVE-3959.patch.1, HIVE-3959.patch.2, HIVE-3959.patch.9.txt When partitions are created using queries (insert overwrite and insert into) then the StatsTask updates all stats. However, when partitions are added directly through metadata-only partitions (either CLI or direct calls to Thrift Metastore) no stats are populated even if hive.stats.reliable is set to true. This puts us in a situation where we can't decide if stats are truly reliable or not. We propose that the fast stats (numFiles and totalSize) which don't require a scan of the data should always be populated and be completely reliable. For now we are still excluding rowCount and rawDataSize because that will make these operations very expensive. Currently they are quick metadata-only ops. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-3959) Update Partition Statistics in Metastore Layer
[ https://issues.apache.org/jira/browse/HIVE-3959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Tim Liu updated HIVE-3959: --- Attachment: HIVE-3959.patch.9.txt Update Partition Statistics in Metastore Layer -- Key: HIVE-3959 URL: https://issues.apache.org/jira/browse/HIVE-3959 Project: Hive Issue Type: Improvement Components: Metastore, Statistics Reporter: Bhushan Mandhani Assignee: Gang Tim Liu Priority: Minor Attachments: HIVE-3959.patch.1, HIVE-3959.patch.2, HIVE-3959.patch.9.txt When partitions are created using queries (insert overwrite and insert into) then the StatsTask updates all stats. However, when partitions are added directly through metadata-only partitions (either CLI or direct calls to Thrift Metastore) no stats are populated even if hive.stats.reliable is set to true. This puts us in a situation where we can't decide if stats are truly reliable or not. We propose that the fast stats (numFiles and totalSize) which don't require a scan of the data should always be populated and be completely reliable. For now we are still excluding rowCount and rawDataSize because that will make these operations very expensive. Currently they are quick metadata-only ops. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4474) Column access not tracked properly for partitioned tables
[ https://issues.apache.org/jira/browse/HIVE-4474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13647206#comment-13647206 ] Gang Tim Liu commented on HIVE-4474: +1 Column access not tracked properly for partitioned tables - Key: HIVE-4474 URL: https://issues.apache.org/jira/browse/HIVE-4474 Project: Hive Issue Type: Bug Components: Query Processor Reporter: Samuel Yuan Assignee: Samuel Yuan Attachments: HIVE-4474.1.patch.txt The columns recorded as being accessed is incorrect for partitioned tables. The index of accessed columns is a position in the list of non-partition columns, but a list of all columns is being used right now to do the lookup. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HIVE-4456) Datanucleus throws NPE after passing a config from test file (.q) to hive metastore
Gang Tim Liu created HIVE-4456: -- Summary: Datanucleus throws NPE after passing a config from test file (.q) to hive metastore Key: HIVE-4456 URL: https://issues.apache.org/jira/browse/HIVE-4456 Project: Hive Issue Type: Bug Components: Configuration, Metastore Reporter: Gang Tim Liu Priority: Critical create a configuration file with the following: set hive.metastore.ds.retry.interval=2000; create table analyze_srcpart like srcpart; run ant test -Dtestcase=TestCliDriver -Dqfile=file NPE is thrown. See attached files. Anything special for hive.metastore.ds.retry.interval? It is a config listed under HiveConf.metaVars. Then, HiveConf.get(HiveConf c) will recreate a new conf while detecting a difference. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-4456) Datanucleus throws NPE after passing a config from test file (.q) to hive metastore
[ https://issues.apache.org/jira/browse/HIVE-4456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Tim Liu updated HIVE-4456: --- Attachment: err.txt Datanucleus throws NPE after passing a config from test file (.q) to hive metastore --- Key: HIVE-4456 URL: https://issues.apache.org/jira/browse/HIVE-4456 Project: Hive Issue Type: Bug Components: Configuration, Metastore Reporter: Gang Tim Liu Priority: Critical Attachments: err.txt create a configuration file with the following: set hive.metastore.ds.retry.interval=2000; create table analyze_srcpart like srcpart; run ant test -Dtestcase=TestCliDriver -Dqfile=file NPE is thrown. See attached files. Anything special for hive.metastore.ds.retry.interval? It is a config listed under HiveConf.metaVars. Then, HiveConf.get(HiveConf c) will recreate a new conf while detecting a difference. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4389) thrift files are re-generated by compiling
[ https://issues.apache.org/jira/browse/HIVE-4389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13644631#comment-13644631 ] Gang Tim Liu commented on HIVE-4389: +1 thrift files are re-generated by compiling -- Key: HIVE-4389 URL: https://issues.apache.org/jira/browse/HIVE-4389 Project: Hive Issue Type: Bug Reporter: Namit Jain Assignee: Namit Jain Attachments: hive.4389.1.patch I am not sure what is going on, but there seems to be a bunch of thrift changes if I perform ant thriftif. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
How to pass config from qfile to Hive Metastore
Hi Dear all, I want to set a configuration in file and pass it to Hive Metastore for example logic in HiveAlterHandler.java. In order to do that, this configuration should be in HiveConf.metaVars. But, a simple test got NPE. Anyone has experience to pass config from qfile to Hive metastore? Attached has status.q. It has set hive.metastore.ds.retry.interval=2000 which is part of HiveConf.metaVars. Attached has error.txt. If we remove the config line from status.q, it works. Thanks Tim 2013-04-26 14:34:41,603 ERROR exec.Task (SessionState.java:printError(388)) - FAILED: Error in metadata: Unable to fetch table srcpart org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table srcpart at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:957) at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:891) at org.apache.hadoop.hive.ql.exec.DDLTask.createTableLike(DDLTask.java:3803) at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:279) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:145) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1355) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1139) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:945) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:259) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:216) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:348) at org.apache.hadoop.hive.ql.QTestUtil.executeClient(QTestUtil.java:790) at org.apache.hadoop.hive.cli.TestCliDriver.runTest(TestCliDriver.java:124) at org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_stats60(TestCliDriver.java:108) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at junit.framework.TestCase.runTest(TestCase.java:154) at junit.framework.TestCase.runBare(TestCase.java:127) at junit.framework.TestResult$1.protect(TestResult.java:106) at junit.framework.TestResult.runProtected(TestResult.java:124) at junit.framework.TestResult.run(TestResult.java:109) at junit.framework.TestCase.run(TestCase.java:118) at junit.framework.TestSuite.runTest(TestSuite.java:208) at junit.framework.TestSuite.run(TestSuite.java:203) at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:422) at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:931) at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:785) Caused by: java.lang.NullPointerException at org.datanucleus.sco.simple.Set.init(Set.java:68) at org.datanucleus.sco.backed.Set.init(Set.java:94) at org.datanucleus.sco.backed.Map.entrySet(Map.java:418) at org.apache.hadoop.hive.metastore.api.SerDeInfo.init(SerDeInfo.java:157) at org.apache.hadoop.hive.metastore.api.StorageDescriptor.init(StorageDescriptor.java:256) at org.apache.hadoop.hive.metastore.api.Table.init(Table.java:260) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.deepCopy(HiveMetaStoreClient.java:1177) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:854) at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:74) at $Proxy7.getTable(Unknown Source) at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:949) ... 30 more 2013-04-26 14:34:41,603 DEBUG exec.DDLTask (DDLTask.java:execute(459)) - org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table srcpart at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:957) at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:891) at org.apache.hadoop.hive.ql.exec.DDLTask.createTableLike(DDLTask.java:3803) at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:279) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:145) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57) at
[jira] [Assigned] (HIVE-3682) when output hive table to file,users should could have a separator of their own choice
[ https://issues.apache.org/jira/browse/HIVE-3682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Tim Liu reassigned HIVE-3682: -- Assignee: (was: Gang Tim Liu) when output hive table to file,users should could have a separator of their own choice -- Key: HIVE-3682 URL: https://issues.apache.org/jira/browse/HIVE-3682 Project: Hive Issue Type: New Feature Components: CLI Affects Versions: 0.8.1 Environment: Linux 3.0.0-14-generic #23-Ubuntu SMP Mon Nov 21 20:34:47 UTC 2011 i686 i686 i386 GNU/Linux java version 1.6.0_25 hadoop-0.20.2-cdh3u0 hive-0.8.1 Reporter: caofangkun Attachments: HIVE-3682-1.patch, HIVE-3682.D10275.1.patch, HIVE-3682.with.serde.patch By default,when output hive table to file ,columns of the Hive table are separated by ^A character (that is \001). But indeed users should have the right to set a seperator of their own choice. Usage Example: create table for_test (key string, value string); load data local inpath './in1.txt' into table for_test select * from for_test; UT-01:default separator is \001 line separator is \n insert overwrite local directory './test-01' select * from src ; create table array_table (a arraystring, b arraystring) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY ','; load data local inpath ../hive/examples/files/arraytest.txt overwrite into table table2; CREATE TABLE map_table (foo STRING , bar MAPSTRING, STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY ',' MAP KEYS TERMINATED BY ':' STORED AS TEXTFILE; UT-02:defined field separator as ':' insert overwrite local directory './test-02' row format delimited FIELDS TERMINATED BY ':' select * from src ; UT-03: line separator DO NOT ALLOWED to define as other separator insert overwrite local directory './test-03' row format delimited FIELDS TERMINATED BY ':' select * from src ; UT-04: define map separators insert overwrite local directory './test-04' row format delimited FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY ',' MAP KEYS TERMINATED BY ':' select * from src; -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HIVE-3682) when output hive table to file,users should could have a separator of their own choice
[ https://issues.apache.org/jira/browse/HIVE-3682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Tim Liu reassigned HIVE-3682: -- Assignee: Sushanth Sowmyan when output hive table to file,users should could have a separator of their own choice -- Key: HIVE-3682 URL: https://issues.apache.org/jira/browse/HIVE-3682 Project: Hive Issue Type: New Feature Components: CLI Affects Versions: 0.8.1 Environment: Linux 3.0.0-14-generic #23-Ubuntu SMP Mon Nov 21 20:34:47 UTC 2011 i686 i686 i386 GNU/Linux java version 1.6.0_25 hadoop-0.20.2-cdh3u0 hive-0.8.1 Reporter: caofangkun Assignee: Sushanth Sowmyan Attachments: HIVE-3682-1.patch, HIVE-3682.D10275.1.patch, HIVE-3682.with.serde.patch By default,when output hive table to file ,columns of the Hive table are separated by ^A character (that is \001). But indeed users should have the right to set a seperator of their own choice. Usage Example: create table for_test (key string, value string); load data local inpath './in1.txt' into table for_test select * from for_test; UT-01:default separator is \001 line separator is \n insert overwrite local directory './test-01' select * from src ; create table array_table (a arraystring, b arraystring) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY ','; load data local inpath ../hive/examples/files/arraytest.txt overwrite into table table2; CREATE TABLE map_table (foo STRING , bar MAPSTRING, STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY ',' MAP KEYS TERMINATED BY ':' STORED AS TEXTFILE; UT-02:defined field separator as ':' insert overwrite local directory './test-02' row format delimited FIELDS TERMINATED BY ':' select * from src ; UT-03: line separator DO NOT ALLOWED to define as other separator insert overwrite local directory './test-03' row format delimited FIELDS TERMINATED BY ':' select * from src ; UT-04: define map separators insert overwrite local directory './test-04' row format delimited FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY ',' MAP KEYS TERMINATED BY ':' select * from src; -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4310) optimize count(distinct) with hive.map.groupby.sorted
[ https://issues.apache.org/jira/browse/HIVE-4310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13637097#comment-13637097 ] Gang Tim Liu commented on HIVE-4310: +1 optimize count(distinct) with hive.map.groupby.sorted - Key: HIVE-4310 URL: https://issues.apache.org/jira/browse/HIVE-4310 Project: Hive Issue Type: Improvement Reporter: Namit Jain Assignee: Namit Jain Attachments: hive.4310.1.patch, hive.4310.1.patch-nohcat, hive.4310.2.patch-nohcat, hive.4310.3.patch-nohcat, hive.4310.4.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: hi
Super like it. On 4/18/13 5:31 AM, Namit Jain nj...@fb.com wrote: Hi, Since we are developing at a very fast pace, it would be really useful to think about maintainability and testing of the large codebase. Historically, we have not focussed on a few things, and they might soon bite us. I wanted to propose the following for all checkins: 1. Javadoc for all public/private functions, except for setters/getters. For any complex function, clear examples (input/output) would really help. 2. Convention for variable/function names do we have any ? 3. If possible, the test name (.q file) where the function is being invoked, or the query which would potentially test that scenario, if it is a query processor change. 4. Specially, for query optimizations, it might be a good idea to have a simple working query at the top, and the expected changes. For e.g.. The operator tree for that query at each step, or a detailed explanation at the top. 5. Comments in each test (.q file) that should include the jira number, what is it trying to test. Assumptions about each query. 6. Reduce the output for each test whenever query is outputting more than 10 results, it should have a reason. Otherwise, each query result should be bounded by 10 rows. In general, focussing on a lot of comments in the code will go a long way for everyone to follow along. Thanks, -namit
[jira] [Created] (HIVE-4377) Add more comment to https://reviews.facebook.net/D1209 (HIVE-2340)
Gang Tim Liu created HIVE-4377: -- Summary: Add more comment to https://reviews.facebook.net/D1209 (HIVE-2340) Key: HIVE-4377 URL: https://issues.apache.org/jira/browse/HIVE-4377 Project: Hive Issue Type: Bug Components: Query Processor Reporter: Gang Tim Liu Assignee: Navis thanks a lot for addressing optimization in HIVE-2340. Awesome! Since we are developing at a very fast pace, it would be really useful to think about maintainability and testing of the large codebase. Highlights which are applicable for D1209: 1. Javadoc for all public/private functions, except for setters/getters. For any complex function, clear examples (input/output) would really help. 2. Specially, for query optimizations, it might be a good idea to have a simple working query at the top, and the expected changes. For e.g.. The operator tree for that query at each step, or a detailed explanation at the top. 3. If possible, the test name (.q file) where the function is being invoked, or the query which would potentially test that scenario, if it is a query processor change. 4. Comments in each test (.q file) that should include the jira number, what is it trying to test. Assumptions about each query. 5. Reduce the output for each test whenever query is outputting more than 10 results, it should have a reason. Otherwise, each query result should be bounded by 10 rows. thanks a lot -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-446) Implement TRUNCATE
[ https://issues.apache.org/jira/browse/HIVE-446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630258#comment-13630258 ] Gang Tim Liu commented on HIVE-446: --- External table is used in the context where data is not fully managed. If it ends up that there is a need to remove data behind external table, a question can be asked why do you define it as external table?. Saying that, possibly the proposed syntax and semantics are not consistent to external table use case. thanks Implement TRUNCATE -- Key: HIVE-446 URL: https://issues.apache.org/jira/browse/HIVE-446 Project: Hive Issue Type: New Feature Components: Query Processor Reporter: Prasad Chakka Assignee: Navis Fix For: 0.11.0 Attachments: HIVE-446.D7371.1.patch, HIVE-446.D7371.2.patch, HIVE-446.D7371.3.patch, HIVE-446.D7371.4.patch truncate the data but leave the table and metadata intact. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4322) SkewedInfo in Metastore Thrift API cannot be deserialized in Python
[ https://issues.apache.org/jira/browse/HIVE-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630546#comment-13630546 ] Gang Tim Liu commented on HIVE-4322: +1 after test passes SkewedInfo in Metastore Thrift API cannot be deserialized in Python --- Key: HIVE-4322 URL: https://issues.apache.org/jira/browse/HIVE-4322 Project: Hive Issue Type: Bug Components: Metastore, Thrift API Affects Versions: 0.11.0 Reporter: Samuel Yuan Assignee: Samuel Yuan Priority: Minor Attachments: HIVE-4322.HIVE-4322.HIVE-4322.HIVE-4322.D10203.1.patch The Thrift-generated Python code that deserializes Thrift objects fails whenever a complex type is used as a map key, because by default mutable Python objects such as lists do not have a hash function. See https://issues.apache.org/jira/browse/THRIFT-162 for related discussion. The SkewedInfo struct contains a map which uses a list as a key, breaking the Python Thrift interface. It is not possible to specify the mapping from Thrift types to Python types, or otherwise we could map Thrift lists to Python tuples. Instead, the proposed workaround wraps the list inside a new struct. This alone does not accomplish anything, but allows Python clients to define a hash function for the struct class, e.g.: def f(object): return hash(tuple(object.skewedValueList)) SkewedValueList.__hash__ = f In practice a more efficient hash might be defined that does not involve copying the list. The advantage of wrapping the list inside a struct is that the client does not have to define the hash on the list itself, which would change the behaviour of lists everywhere else in the code. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HIVE-4351) Thrift code generation fails due to hcatalog
Gang Tim Liu created HIVE-4351: -- Summary: Thrift code generation fails due to hcatalog Key: HIVE-4351 URL: https://issues.apache.org/jira/browse/HIVE-4351 Project: Hive Issue Type: Bug Components: Thrift API Affects Versions: 0.11.0 Reporter: Gang Tim Liu Assignee: Ashutosh Chauhan It fails to generate thrift code since hcatalog doesn't have Target thriftif ant thriftif -Dthrift.home=/usr/local . BUILD FAILED Target thriftif does not exist in the project hcatalog. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4322) SkewedInfo in Metastore Thrift API cannot be deserialized in Python
[ https://issues.apache.org/jira/browse/HIVE-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630689#comment-13630689 ] Gang Tim Liu commented on HIVE-4322: Committed. thank Samuel Yuan. SkewedInfo in Metastore Thrift API cannot be deserialized in Python --- Key: HIVE-4322 URL: https://issues.apache.org/jira/browse/HIVE-4322 Project: Hive Issue Type: Bug Components: Metastore, Thrift API Affects Versions: 0.11.0 Reporter: Samuel Yuan Assignee: Samuel Yuan Priority: Minor Attachments: HIVE-4322.HIVE-4322.HIVE-4322.HIVE-4322.D10203.1.patch The Thrift-generated Python code that deserializes Thrift objects fails whenever a complex type is used as a map key, because by default mutable Python objects such as lists do not have a hash function. See https://issues.apache.org/jira/browse/THRIFT-162 for related discussion. The SkewedInfo struct contains a map which uses a list as a key, breaking the Python Thrift interface. It is not possible to specify the mapping from Thrift types to Python types, or otherwise we could map Thrift lists to Python tuples. Instead, the proposed workaround wraps the list inside a new struct. This alone does not accomplish anything, but allows Python clients to define a hash function for the struct class, e.g.: def f(object): return hash(tuple(object.skewedValueList)) SkewedValueList.__hash__ = f In practice a more efficient hash might be defined that does not involve copying the list. The advantage of wrapping the list inside a struct is that the client does not have to define the hash on the list itself, which would change the behaviour of lists everywhere else in the code. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-4322) SkewedInfo in Metastore Thrift API cannot be deserialized in Python
[ https://issues.apache.org/jira/browse/HIVE-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Tim Liu updated HIVE-4322: --- Resolution: Fixed Fix Version/s: 0.11.0 Status: Resolved (was: Patch Available) SkewedInfo in Metastore Thrift API cannot be deserialized in Python --- Key: HIVE-4322 URL: https://issues.apache.org/jira/browse/HIVE-4322 Project: Hive Issue Type: Bug Components: Metastore, Thrift API Affects Versions: 0.11.0 Reporter: Samuel Yuan Assignee: Samuel Yuan Priority: Minor Fix For: 0.11.0 Attachments: HIVE-4322.HIVE-4322.HIVE-4322.HIVE-4322.D10203.1.patch The Thrift-generated Python code that deserializes Thrift objects fails whenever a complex type is used as a map key, because by default mutable Python objects such as lists do not have a hash function. See https://issues.apache.org/jira/browse/THRIFT-162 for related discussion. The SkewedInfo struct contains a map which uses a list as a key, breaking the Python Thrift interface. It is not possible to specify the mapping from Thrift types to Python types, or otherwise we could map Thrift lists to Python tuples. Instead, the proposed workaround wraps the list inside a new struct. This alone does not accomplish anything, but allows Python clients to define a hash function for the struct class, e.g.: def f(object): return hash(tuple(object.skewedValueList)) SkewedValueList.__hash__ = f In practice a more efficient hash might be defined that does not involve copying the list. The advantage of wrapping the list inside a struct is that the client does not have to define the hash on the list itself, which would change the behaviour of lists everywhere else in the code. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4351) Thrift code generation fails due to hcatalog
[ https://issues.apache.org/jira/browse/HIVE-4351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630691#comment-13630691 ] Gang Tim Liu commented on HIVE-4351: thank [~ashutoshc] very much Thrift code generation fails due to hcatalog Key: HIVE-4351 URL: https://issues.apache.org/jira/browse/HIVE-4351 Project: Hive Issue Type: Bug Components: Thrift API Affects Versions: 0.11.0 Reporter: Gang Tim Liu Assignee: Ashutosh Chauhan Fix For: 0.12.0 It fails to generate thrift code since hcatalog doesn't have Target thriftif ant thriftif -Dthrift.home=/usr/local . BUILD FAILED Target thriftif does not exist in the project hcatalog. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4241) optimize hive.enforce.sorting and hive.enforce bucketing join
[ https://issues.apache.org/jira/browse/HIVE-4241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13629237#comment-13629237 ] Gang Tim Liu commented on HIVE-4241: +1 optimize hive.enforce.sorting and hive.enforce bucketing join - Key: HIVE-4241 URL: https://issues.apache.org/jira/browse/HIVE-4241 Project: Hive Issue Type: Improvement Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Attachments: hive.4241.1.patch, hive.4241.1.patch-nohcat, hive.4241.2.patch-nohcat Consider the following scenario: T1: sorted and bucketed by key into 2 buckets T2: sorted and bucketed by key into 2 buckets T3: sorted and bucketed by key into 2 buckets set hive.enforce.sorting=true; set hive.enforce.bucketing=true; insert overwrite table T3 select .. from T1 join T2 on T1.key = T2.key; Since T1, T2 and T3 are sorted/bucketed by the join, and the above join is being performed as a sort-merge join, T3 should be bucketed/sorted without the need for an extra reducer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4337) Update list bucketing test results
[ https://issues.apache.org/jira/browse/HIVE-4337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13628419#comment-13628419 ] Gang Tim Liu commented on HIVE-4337: +1 Update list bucketing test results -- Key: HIVE-4337 URL: https://issues.apache.org/jira/browse/HIVE-4337 Project: Hive Issue Type: Test Components: Tests Affects Versions: 0.11.0 Reporter: Samuel Yuan Assignee: Samuel Yuan Priority: Trivial Attachments: HIVE-4337.HIVE-4337.HIVE-4337.D10131.1.patch A recent change resulted in different output for the list bucketing tests, which run for Hadoop23. The output files were not updated to reflect this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-4337) Update list bucketing test results
[ https://issues.apache.org/jira/browse/HIVE-4337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Tim Liu updated HIVE-4337: --- Status: Patch Available (was: Open) Update list bucketing test results -- Key: HIVE-4337 URL: https://issues.apache.org/jira/browse/HIVE-4337 Project: Hive Issue Type: Test Components: Tests Affects Versions: 0.11.0 Reporter: Samuel Yuan Assignee: Samuel Yuan Priority: Trivial Attachments: HIVE-4337.HIVE-4337.HIVE-4337.D10131.1.patch A recent change resulted in different output for the list bucketing tests, which run for Hadoop23. The output files were not updated to reflect this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-4337) Update list bucketing test results
[ https://issues.apache.org/jira/browse/HIVE-4337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Tim Liu updated HIVE-4337: --- Resolution: Fixed Fix Version/s: 0.11.0 Status: Resolved (was: Patch Available) Update list bucketing test results -- Key: HIVE-4337 URL: https://issues.apache.org/jira/browse/HIVE-4337 Project: Hive Issue Type: Test Components: Tests Affects Versions: 0.11.0 Reporter: Samuel Yuan Assignee: Samuel Yuan Priority: Trivial Fix For: 0.11.0 Attachments: HIVE-4337.HIVE-4337.HIVE-4337.D10131.1.patch A recent change resulted in different output for the list bucketing tests, which run for Hadoop23. The output files were not updated to reflect this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4322) SkewedInfo in Metastore Thrift API cannot be deserialized in Python
[ https://issues.apache.org/jira/browse/HIVE-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13627275#comment-13627275 ] Gang Tim Liu commented on HIVE-4322: [~sxyuan] Good write up. thank you for working on it. SkewedInfo in Metastore Thrift API cannot be deserialized in Python --- Key: HIVE-4322 URL: https://issues.apache.org/jira/browse/HIVE-4322 Project: Hive Issue Type: Bug Components: Metastore, Thrift API Affects Versions: 0.11.0 Reporter: Samuel Yuan Assignee: Samuel Yuan Priority: Minor The Thrift-generated Python code that deserializes Thrift objects fails whenever a complex type is used as a map key, because by default mutable Python objects such as lists do not have a hash function. See https://issues.apache.org/jira/browse/THRIFT-162 for related discussion. The SkewedInfo struct contains a map which uses a list as a key, breaking the Python Thrift interface. It is not possible to specify the mapping from Thrift types to Python types, or otherwise we could map Thrift lists to Python tuples. Instead, the proposed workaround wraps the list inside a new struct. This alone does not accomplish anything, but allows Python clients to define a hash function for the struct class, e.g.: def f(object): return hash(tuple(object.skewedValueList)) SkewedValueList.__hash__ = f In practice a more efficient hash might be defined that does not involve copying the list. The advantage of wrapping the list inside a struct is that the client does not have to define the hash on the list itself, which would change the behaviour of lists everywhere else in the code. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4298) add tests for distincts for hive.map.groutp.sorted
[ https://issues.apache.org/jira/browse/HIVE-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13624261#comment-13624261 ] Gang Tim Liu commented on HIVE-4298: +1 add tests for distincts for hive.map.groutp.sorted -- Key: HIVE-4298 URL: https://issues.apache.org/jira/browse/HIVE-4298 Project: Hive Issue Type: Test Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Attachments: hive.4298.1.patch, hive.4298.2.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4298) add tests for distincts for hive.map.groutp.sorted
[ https://issues.apache.org/jira/browse/HIVE-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13624275#comment-13624275 ] Gang Tim Liu commented on HIVE-4298: Committed. thank Namit. add tests for distincts for hive.map.groutp.sorted -- Key: HIVE-4298 URL: https://issues.apache.org/jira/browse/HIVE-4298 Project: Hive Issue Type: Test Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Attachments: hive.4298.1.patch, hive.4298.2.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4298) add tests for distincts for hive.map.groutp.sorted
[ https://issues.apache.org/jira/browse/HIVE-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13624316#comment-13624316 ] Gang Tim Liu commented on HIVE-4298: Woo, thank Ashutosh add tests for distincts for hive.map.groutp.sorted -- Key: HIVE-4298 URL: https://issues.apache.org/jira/browse/HIVE-4298 Project: Hive Issue Type: Test Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Fix For: 0.11.0 Attachments: hive.4298.1.patch, hive.4298.2.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HIVE-4213) List bucketing error too restrictive
[ https://issues.apache.org/jira/browse/HIVE-4213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Tim Liu reassigned HIVE-4213: -- Assignee: Gang Tim Liu List bucketing error too restrictive Key: HIVE-4213 URL: https://issues.apache.org/jira/browse/HIVE-4213 Project: Hive Issue Type: Bug Affects Versions: 0.10.0 Reporter: Mark Grover Assignee: Gang Tim Liu Fix For: 0.11.0 With the introduction of List bucketing, we introduced a config validation step where we say: {code} SUPPORT_DIR_MUST_TRUE_FOR_LIST_BUCKETING( 10199, hive.mapred.supports.subdirectories must be true + if any one of following is true: hive.internal.ddl.list.bucketing.enable, + hive.optimize.listbucketing and mapred.input.dir.recursive), {code} This seems overly restrictive to because there are use cases where people may want to use {{mapred.input.dir.recursive}} to {{true}} even when they don't care about list bucketing. Is that not true? For example, here is the unit test code for {{clientpositive/recursive_dir.q}} {code} CREATE TABLE fact_daily(x int) PARTITIONED BY (ds STRING); CREATE TABLE fact_tz(x int) PARTITIONED BY (ds STRING, hr STRING) LOCATION 'pfile:${system:test.tmp.dir}/fact_tz'; INSERT OVERWRITE TABLE fact_tz PARTITION (ds='1', hr='1') SELECT key+11 FROM src WHERE key=484; ALTER TABLE fact_daily SET TBLPROPERTIES('EXTERNAL'='TRUE'); ALTER TABLE fact_daily ADD PARTITION (ds='1') LOCATION 'pfile:${system:test.tmp.dir}/fact_tz/ds=1'; set hive.mapred.supports.subdirectories=true; set mapred.input.dir.recursive=true; set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; SELECT * FROM fact_daily WHERE ds='1'; SELECT count(1) FROM fact_daily WHERE ds='1'; {code} The unit test doesn't seem to be concerned about list bucketing but wants to set {{mapred.input.dir.recursive}} to {{true}}. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (HIVE-4213) List bucketing error too restrictive
[ https://issues.apache.org/jira/browse/HIVE-4213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Tim Liu resolved HIVE-4213. Resolution: Not A Problem List bucketing error too restrictive Key: HIVE-4213 URL: https://issues.apache.org/jira/browse/HIVE-4213 Project: Hive Issue Type: Bug Affects Versions: 0.10.0 Reporter: Mark Grover Assignee: Gang Tim Liu Fix For: 0.11.0 With the introduction of List bucketing, we introduced a config validation step where we say: {code} SUPPORT_DIR_MUST_TRUE_FOR_LIST_BUCKETING( 10199, hive.mapred.supports.subdirectories must be true + if any one of following is true: hive.internal.ddl.list.bucketing.enable, + hive.optimize.listbucketing and mapred.input.dir.recursive), {code} This seems overly restrictive to because there are use cases where people may want to use {{mapred.input.dir.recursive}} to {{true}} even when they don't care about list bucketing. Is that not true? For example, here is the unit test code for {{clientpositive/recursive_dir.q}} {code} CREATE TABLE fact_daily(x int) PARTITIONED BY (ds STRING); CREATE TABLE fact_tz(x int) PARTITIONED BY (ds STRING, hr STRING) LOCATION 'pfile:${system:test.tmp.dir}/fact_tz'; INSERT OVERWRITE TABLE fact_tz PARTITION (ds='1', hr='1') SELECT key+11 FROM src WHERE key=484; ALTER TABLE fact_daily SET TBLPROPERTIES('EXTERNAL'='TRUE'); ALTER TABLE fact_daily ADD PARTITION (ds='1') LOCATION 'pfile:${system:test.tmp.dir}/fact_tz/ds=1'; set hive.mapred.supports.subdirectories=true; set mapred.input.dir.recursive=true; set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; SELECT * FROM fact_daily WHERE ds='1'; SELECT count(1) FROM fact_daily WHERE ds='1'; {code} The unit test doesn't seem to be concerned about list bucketing but wants to set {{mapred.input.dir.recursive}} to {{true}}. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4213) List bucketing error too restrictive
[ https://issues.apache.org/jira/browse/HIVE-4213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13621169#comment-13621169 ] Gang Tim Liu commented on HIVE-4213: Hi [~mgrover] No problem. Not sure it is valid if mapred.input.dir.recursive is true but hive.mapred.supports.subdirectories is false. cc [~namitjain] would you please confirm? thanks List bucketing error too restrictive Key: HIVE-4213 URL: https://issues.apache.org/jira/browse/HIVE-4213 Project: Hive Issue Type: Bug Affects Versions: 0.10.0 Reporter: Mark Grover Assignee: Gang Tim Liu Fix For: 0.11.0 With the introduction of List bucketing, we introduced a config validation step where we say: {code} SUPPORT_DIR_MUST_TRUE_FOR_LIST_BUCKETING( 10199, hive.mapred.supports.subdirectories must be true + if any one of following is true: hive.internal.ddl.list.bucketing.enable, + hive.optimize.listbucketing and mapred.input.dir.recursive), {code} This seems overly restrictive to because there are use cases where people may want to use {{mapred.input.dir.recursive}} to {{true}} even when they don't care about list bucketing. Is that not true? For example, here is the unit test code for {{clientpositive/recursive_dir.q}} {code} CREATE TABLE fact_daily(x int) PARTITIONED BY (ds STRING); CREATE TABLE fact_tz(x int) PARTITIONED BY (ds STRING, hr STRING) LOCATION 'pfile:${system:test.tmp.dir}/fact_tz'; INSERT OVERWRITE TABLE fact_tz PARTITION (ds='1', hr='1') SELECT key+11 FROM src WHERE key=484; ALTER TABLE fact_daily SET TBLPROPERTIES('EXTERNAL'='TRUE'); ALTER TABLE fact_daily ADD PARTITION (ds='1') LOCATION 'pfile:${system:test.tmp.dir}/fact_tz/ds=1'; set hive.mapred.supports.subdirectories=true; set mapred.input.dir.recursive=true; set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; SELECT * FROM fact_daily WHERE ds='1'; SELECT count(1) FROM fact_daily WHERE ds='1'; {code} The unit test doesn't seem to be concerned about list bucketing but wants to set {{mapred.input.dir.recursive}} to {{true}}. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-3959) Update Partition Statistics in Metastore Layer
[ https://issues.apache.org/jira/browse/HIVE-3959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Tim Liu updated HIVE-3959: --- Attachment: HIVE-3959.patch.1 Update Partition Statistics in Metastore Layer -- Key: HIVE-3959 URL: https://issues.apache.org/jira/browse/HIVE-3959 Project: Hive Issue Type: Improvement Components: Metastore, Statistics Reporter: Bhushan Mandhani Assignee: Gang Tim Liu Priority: Minor Attachments: HIVE-3959.patch.1 When partitions are created using queries (insert overwrite and insert into) then the StatsTask updates all stats. However, when partitions are added directly through metadata-only partitions (either CLI or direct calls to Thrift Metastore) no stats are populated even if hive.stats.reliable is set to true. This puts us in a situation where we can't decide if stats are truly reliable or not. We propose that the fast stats (numFiles and totalSize) which don't require a scan of the data should always be populated and be completely reliable. For now we are still excluding rowCount and rawDataSize because that will make these operations very expensive. Currently they are quick metadata-only ops. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-3959) Update Partition Statistics in Metastore Layer
[ https://issues.apache.org/jira/browse/HIVE-3959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Tim Liu updated HIVE-3959: --- Attachment: HIVE-3959.patch.2 Update Partition Statistics in Metastore Layer -- Key: HIVE-3959 URL: https://issues.apache.org/jira/browse/HIVE-3959 Project: Hive Issue Type: Improvement Components: Metastore, Statistics Reporter: Bhushan Mandhani Assignee: Gang Tim Liu Priority: Minor Attachments: HIVE-3959.patch.1, HIVE-3959.patch.2 When partitions are created using queries (insert overwrite and insert into) then the StatsTask updates all stats. However, when partitions are added directly through metadata-only partitions (either CLI or direct calls to Thrift Metastore) no stats are populated even if hive.stats.reliable is set to true. This puts us in a situation where we can't decide if stats are truly reliable or not. We propose that the fast stats (numFiles and totalSize) which don't require a scan of the data should always be populated and be completely reliable. For now we are still excluding rowCount and rawDataSize because that will make these operations very expensive. Currently they are quick metadata-only ops. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-3959) Update Partition Statistics in Metastore Layer
[ https://issues.apache.org/jira/browse/HIVE-3959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Tim Liu updated HIVE-3959: --- Attachment: HIVE-3959.patch.2.nohcat Update Partition Statistics in Metastore Layer -- Key: HIVE-3959 URL: https://issues.apache.org/jira/browse/HIVE-3959 Project: Hive Issue Type: Improvement Components: Metastore, Statistics Reporter: Bhushan Mandhani Assignee: Gang Tim Liu Priority: Minor Attachments: HIVE-3959.patch.1, HIVE-3959.patch.2, HIVE-3959.patch.2.nohcat When partitions are created using queries (insert overwrite and insert into) then the StatsTask updates all stats. However, when partitions are added directly through metadata-only partitions (either CLI or direct calls to Thrift Metastore) no stats are populated even if hive.stats.reliable is set to true. This puts us in a situation where we can't decide if stats are truly reliable or not. We propose that the fast stats (numFiles and totalSize) which don't require a scan of the data should always be populated and be completely reliable. For now we are still excluding rowCount and rawDataSize because that will make these operations very expensive. Currently they are quick metadata-only ops. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4281) add hive.map.groupby.sorted.testmode
[ https://issues.apache.org/jira/browse/HIVE-4281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13621408#comment-13621408 ] Gang Tim Liu commented on HIVE-4281: +1 add hive.map.groupby.sorted.testmode Key: HIVE-4281 URL: https://issues.apache.org/jira/browse/HIVE-4281 Project: Hive Issue Type: Improvement Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Attachments: hive.4281.1.patch, hive.4281.2.patch, hive.4281.2.patch-nohcat, hive.4281.3.patch The idea behind this would be to test hive.map.groupby.sorted. Since this is a new feature, it might be a good idea to run it in test mode, where a query property would denote that this query plan would have changed. If a customer wants, they can run those queries offline, compare the results for correctness, and set hive.map.groupby.sorted only if all the results are the same. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4272) partition wise metadata does not work for text files
[ https://issues.apache.org/jira/browse/HIVE-4272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13619856#comment-13619856 ] Gang Tim Liu commented on HIVE-4272: +1 partition wise metadata does not work for text files Key: HIVE-4272 URL: https://issues.apache.org/jira/browse/HIVE-4272 Project: Hive Issue Type: Bug Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Attachments: hive.4272.1.patch, hive.4272.2.patch, hive.4272.2.patch-nohcat The following test fails: set hive.input.format = org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; -- This tests that the schema can be changed for binary serde data create table partition_test_partitioned(key string, value string) partitioned by (dt string) stored as textfile; insert overwrite table partition_test_partitioned partition(dt='1') select * from src where key = 238; select * from partition_test_partitioned where dt is not null; select key+key, value from partition_test_partitioned where dt is not null; alter table partition_test_partitioned change key key int; select key+key, value from partition_test_partitioned where dt is not null; select * from partition_test_partitioned where dt is not null; It works fine for a RCFile -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-3959) Update Partition Statistics in Metastore Layer
[ https://issues.apache.org/jira/browse/HIVE-3959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13620364#comment-13620364 ] Gang Tim Liu commented on HIVE-3959: rebase https://reviews.facebook.net/D9885 Update Partition Statistics in Metastore Layer -- Key: HIVE-3959 URL: https://issues.apache.org/jira/browse/HIVE-3959 Project: Hive Issue Type: Improvement Components: Metastore, Statistics Reporter: Bhushan Mandhani Assignee: Gang Tim Liu Priority: Minor When partitions are created using queries (insert overwrite and insert into) then the StatsTask updates all stats. However, when partitions are added directly through metadata-only partitions (either CLI or direct calls to Thrift Metastore) no stats are populated even if hive.stats.reliable is set to true. This puts us in a situation where we can't decide if stats are truly reliable or not. We propose that the fast stats (numFiles and totalSize) which don't require a scan of the data should always be populated and be completely reliable. For now we are still excluding rowCount and rawDataSize because that will make these operations very expensive. Currently they are quick metadata-only ops. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4240) optimize hive.enforce.bucketing and hive.enforce sorting insert
[ https://issues.apache.org/jira/browse/HIVE-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13620604#comment-13620604 ] Gang Tim Liu commented on HIVE-4240: +1 optimize hive.enforce.bucketing and hive.enforce sorting insert --- Key: HIVE-4240 URL: https://issues.apache.org/jira/browse/HIVE-4240 Project: Hive Issue Type: Improvement Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Attachments: hive.4240.1.patch, hive.4240.2.patch, hive.4240.3.patch, hive.4240.4.patch, hive.4240.5.patch Consider the following scenario: set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true; set hive.input.format = org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat; set hive.enforce.bucketing=true; set hive.enforce.sorting=true; set hive.exec.reducers.max = 1; set hive.merge.mapfiles=false; set hive.merge.mapredfiles=false; -- Create two bucketed and sorted tables CREATE TABLE test_table1 (key INT, value STRING) PARTITIONED BY (ds STRING) CLUSTERED BY (key) SORTED BY (key) INTO 2 BUCKETS; CREATE TABLE test_table2 (key INT, value STRING) PARTITIONED BY (ds STRING) CLUSTERED BY (key) SORTED BY (key) INTO 2 BUCKETS; FROM src INSERT OVERWRITE TABLE test_table1 PARTITION (ds = '1') SELECT *; -- Insert data into the bucketed table by selecting from another bucketed table -- This should be a map-only operation INSERT OVERWRITE TABLE test_table2 PARTITION (ds = '1') SELECT a.key, a.value FROM test_table1 a WHERE a.ds = '1'; We should not need a reducer to perform the above operation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4270) bug in hive.map.groupby.sorted in the presence of multiple input partitions
[ https://issues.apache.org/jira/browse/HIVE-4270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13618918#comment-13618918 ] Gang Tim Liu commented on HIVE-4270: +1 bug in hive.map.groupby.sorted in the presence of multiple input partitions --- Key: HIVE-4270 URL: https://issues.apache.org/jira/browse/HIVE-4270 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.11.0 Reporter: Namit Jain Assignee: Namit Jain Fix For: 0.11.0 Attachments: hive.4270.1.patch This can lead to wrong results. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HIVE-3959) Update Partition Statistics in Metastore Layer
[ https://issues.apache.org/jira/browse/HIVE-3959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Tim Liu reassigned HIVE-3959: -- Assignee: Gang Tim Liu (was: Bhushan Mandhani) Update Partition Statistics in Metastore Layer -- Key: HIVE-3959 URL: https://issues.apache.org/jira/browse/HIVE-3959 Project: Hive Issue Type: Improvement Components: Metastore, Statistics Reporter: Bhushan Mandhani Assignee: Gang Tim Liu Priority: Minor When partitions are created using queries (insert overwrite and insert into) then the StatsTask updates all stats. However, when partitions are added directly through metadata-only partitions (either CLI or direct calls to Thrift Metastore) no stats are populated even if hive.stats.reliable is set to true. This puts us in a situation where we can't decide if stats are truly reliable or not. We propose that the fast stats (numFiles and totalSize) which don't require a scan of the data should always be populated and be completely reliable. For now we are still excluding rowCount and rawDataSize because that will make these operations very expensive. Currently they are quick metadata-only ops. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4159) RetryingHMSHandler doesn't retry in enough cases
[ https://issues.apache.org/jira/browse/HIVE-4159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616502#comment-13616502 ] Gang Tim Liu commented on HIVE-4159: +1 RetryingHMSHandler doesn't retry in enough cases Key: HIVE-4159 URL: https://issues.apache.org/jira/browse/HIVE-4159 Project: Hive Issue Type: Bug Components: Metastore Affects Versions: 0.11.0 Reporter: Kevin Wilfong Assignee: Kevin Wilfong Attachments: HIVE-4159.1.patch.txt HIVE-3524 introduced a change which caused JDOExceptions to be wrapped in MetaExceptions. This caused the RetryingHMSHandler to not retry on these exceptions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4155) Expose ORC's FileDump as a service
[ https://issues.apache.org/jira/browse/HIVE-4155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616529#comment-13616529 ] Gang Tim Liu commented on HIVE-4155: +1 Expose ORC's FileDump as a service -- Key: HIVE-4155 URL: https://issues.apache.org/jira/browse/HIVE-4155 Project: Hive Issue Type: New Feature Affects Versions: 0.11.0 Reporter: Kevin Wilfong Assignee: Kevin Wilfong Attachments: HIVE-4155.1.patch.txt Expose ORC's FileDump class as a service similar to RC File Cat e.g. hive --orcfiledump path_to_file Should run FileDump on the file. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4157) ORC runs out of heap when writing
[ https://issues.apache.org/jira/browse/HIVE-4157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616557#comment-13616557 ] Gang Tim Liu commented on HIVE-4157: +1 ORC runs out of heap when writing - Key: HIVE-4157 URL: https://issues.apache.org/jira/browse/HIVE-4157 Project: Hive Issue Type: Improvement Components: Serializers/Deserializers Affects Versions: 0.11.0 Reporter: Kevin Wilfong Assignee: Kevin Wilfong Attachments: HIVE-4157.1.patch.txt The OutStream class used by the ORC file format seems to aggressively allocate memory for ByteBuffers and doesn't seem too eager to give it back. This causes issues with heap space, particularly when a wide tables/dynamic partitions are involved. As a first step to resolving this problem, the OutStream class can be modified to lazily allocate memory, and more actively make it available for garbage collection. Follow ups could include checking the amount of free memory as part of determining if a spill is needed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4159) RetryingHMSHandler doesn't retry in enough cases
[ https://issues.apache.org/jira/browse/HIVE-4159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616821#comment-13616821 ] Gang Tim Liu commented on HIVE-4159: Committed. thanks Kevin. RetryingHMSHandler doesn't retry in enough cases Key: HIVE-4159 URL: https://issues.apache.org/jira/browse/HIVE-4159 Project: Hive Issue Type: Bug Components: Metastore Affects Versions: 0.11.0 Reporter: Kevin Wilfong Assignee: Kevin Wilfong Attachments: HIVE-4159.1.patch.txt HIVE-3524 introduced a change which caused JDOExceptions to be wrapped in MetaExceptions. This caused the RetryingHMSHandler to not retry on these exceptions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-4159) RetryingHMSHandler doesn't retry in enough cases
[ https://issues.apache.org/jira/browse/HIVE-4159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Tim Liu updated HIVE-4159: --- Fix Version/s: 0.11.0 RetryingHMSHandler doesn't retry in enough cases Key: HIVE-4159 URL: https://issues.apache.org/jira/browse/HIVE-4159 Project: Hive Issue Type: Bug Components: Metastore Affects Versions: 0.11.0 Reporter: Kevin Wilfong Assignee: Kevin Wilfong Fix For: 0.11.0 Attachments: HIVE-4159.1.patch.txt HIVE-3524 introduced a change which caused JDOExceptions to be wrapped in MetaExceptions. This caused the RetryingHMSHandler to not retry on these exceptions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-4159) RetryingHMSHandler doesn't retry in enough cases
[ https://issues.apache.org/jira/browse/HIVE-4159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Tim Liu updated HIVE-4159: --- Resolution: Fixed Status: Resolved (was: Patch Available) RetryingHMSHandler doesn't retry in enough cases Key: HIVE-4159 URL: https://issues.apache.org/jira/browse/HIVE-4159 Project: Hive Issue Type: Bug Components: Metastore Affects Versions: 0.11.0 Reporter: Kevin Wilfong Assignee: Kevin Wilfong Fix For: 0.11.0 Attachments: HIVE-4159.1.patch.txt HIVE-3524 introduced a change which caused JDOExceptions to be wrapped in MetaExceptions. This caused the RetryingHMSHandler to not retry on these exceptions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4155) Expose ORC's FileDump as a service
[ https://issues.apache.org/jira/browse/HIVE-4155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616828#comment-13616828 ] Gang Tim Liu commented on HIVE-4155: Committed. thanks Kevin Expose ORC's FileDump as a service -- Key: HIVE-4155 URL: https://issues.apache.org/jira/browse/HIVE-4155 Project: Hive Issue Type: New Feature Affects Versions: 0.11.0 Reporter: Kevin Wilfong Assignee: Kevin Wilfong Attachments: HIVE-4155.1.patch.txt Expose ORC's FileDump class as a service similar to RC File Cat e.g. hive --orcfiledump path_to_file Should run FileDump on the file. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-4155) Expose ORC's FileDump as a service
[ https://issues.apache.org/jira/browse/HIVE-4155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Tim Liu updated HIVE-4155: --- Resolution: Fixed Status: Resolved (was: Patch Available) Expose ORC's FileDump as a service -- Key: HIVE-4155 URL: https://issues.apache.org/jira/browse/HIVE-4155 Project: Hive Issue Type: New Feature Affects Versions: 0.11.0 Reporter: Kevin Wilfong Assignee: Kevin Wilfong Fix For: 0.11.0 Attachments: HIVE-4155.1.patch.txt Expose ORC's FileDump class as a service similar to RC File Cat e.g. hive --orcfiledump path_to_file Should run FileDump on the file. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-4155) Expose ORC's FileDump as a service
[ https://issues.apache.org/jira/browse/HIVE-4155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Tim Liu updated HIVE-4155: --- Fix Version/s: 0.11.0 Expose ORC's FileDump as a service -- Key: HIVE-4155 URL: https://issues.apache.org/jira/browse/HIVE-4155 Project: Hive Issue Type: New Feature Affects Versions: 0.11.0 Reporter: Kevin Wilfong Assignee: Kevin Wilfong Fix For: 0.11.0 Attachments: HIVE-4155.1.patch.txt Expose ORC's FileDump class as a service similar to RC File Cat e.g. hive --orcfiledump path_to_file Should run FileDump on the file. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4157) ORC runs out of heap when writing
[ https://issues.apache.org/jira/browse/HIVE-4157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616898#comment-13616898 ] Gang Tim Liu commented on HIVE-4157: Committed. thanks Kevin ORC runs out of heap when writing - Key: HIVE-4157 URL: https://issues.apache.org/jira/browse/HIVE-4157 Project: Hive Issue Type: Improvement Components: Serializers/Deserializers Affects Versions: 0.11.0 Reporter: Kevin Wilfong Assignee: Kevin Wilfong Attachments: HIVE-4157.1.patch.txt The OutStream class used by the ORC file format seems to aggressively allocate memory for ByteBuffers and doesn't seem too eager to give it back. This causes issues with heap space, particularly when a wide tables/dynamic partitions are involved. As a first step to resolving this problem, the OutStream class can be modified to lazily allocate memory, and more actively make it available for garbage collection. Follow ups could include checking the amount of free memory as part of determining if a spill is needed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-4157) ORC runs out of heap when writing
[ https://issues.apache.org/jira/browse/HIVE-4157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Tim Liu updated HIVE-4157: --- Resolution: Fixed Status: Resolved (was: Patch Available) ORC runs out of heap when writing - Key: HIVE-4157 URL: https://issues.apache.org/jira/browse/HIVE-4157 Project: Hive Issue Type: Improvement Components: Serializers/Deserializers Affects Versions: 0.11.0 Reporter: Kevin Wilfong Assignee: Kevin Wilfong Fix For: 0.11.0 Attachments: HIVE-4157.1.patch.txt The OutStream class used by the ORC file format seems to aggressively allocate memory for ByteBuffers and doesn't seem too eager to give it back. This causes issues with heap space, particularly when a wide tables/dynamic partitions are involved. As a first step to resolving this problem, the OutStream class can be modified to lazily allocate memory, and more actively make it available for garbage collection. Follow ups could include checking the amount of free memory as part of determining if a spill is needed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-4157) ORC runs out of heap when writing
[ https://issues.apache.org/jira/browse/HIVE-4157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Tim Liu updated HIVE-4157: --- Fix Version/s: 0.11.0 ORC runs out of heap when writing - Key: HIVE-4157 URL: https://issues.apache.org/jira/browse/HIVE-4157 Project: Hive Issue Type: Improvement Components: Serializers/Deserializers Affects Versions: 0.11.0 Reporter: Kevin Wilfong Assignee: Kevin Wilfong Fix For: 0.11.0 Attachments: HIVE-4157.1.patch.txt The OutStream class used by the ORC file format seems to aggressively allocate memory for ByteBuffers and doesn't seem too eager to give it back. This causes issues with heap space, particularly when a wide tables/dynamic partitions are involved. As a first step to resolving this problem, the OutStream class can be modified to lazily allocate memory, and more actively make it available for garbage collection. Follow ups could include checking the amount of free memory as part of determining if a spill is needed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4157) ORC runs out of heap when writing
[ https://issues.apache.org/jira/browse/HIVE-4157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616901#comment-13616901 ] Gang Tim Liu commented on HIVE-4157: Forgot to mention: tests passed. sorry ORC runs out of heap when writing - Key: HIVE-4157 URL: https://issues.apache.org/jira/browse/HIVE-4157 Project: Hive Issue Type: Improvement Components: Serializers/Deserializers Affects Versions: 0.11.0 Reporter: Kevin Wilfong Assignee: Kevin Wilfong Fix For: 0.11.0 Attachments: HIVE-4157.1.patch.txt The OutStream class used by the ORC file format seems to aggressively allocate memory for ByteBuffers and doesn't seem too eager to give it back. This causes issues with heap space, particularly when a wide tables/dynamic partitions are involved. As a first step to resolving this problem, the OutStream class can be modified to lazily allocate memory, and more actively make it available for garbage collection. Follow ups could include checking the amount of free memory as part of determining if a spill is needed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4159) RetryingHMSHandler doesn't retry in enough cases
[ https://issues.apache.org/jira/browse/HIVE-4159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616902#comment-13616902 ] Gang Tim Liu commented on HIVE-4159: Forgot to mention: tests passed. sorry RetryingHMSHandler doesn't retry in enough cases Key: HIVE-4159 URL: https://issues.apache.org/jira/browse/HIVE-4159 Project: Hive Issue Type: Bug Components: Metastore Affects Versions: 0.11.0 Reporter: Kevin Wilfong Assignee: Kevin Wilfong Fix For: 0.11.0 Attachments: HIVE-4159.1.patch.txt HIVE-3524 introduced a change which caused JDOExceptions to be wrapped in MetaExceptions. This caused the RetryingHMSHandler to not retry on these exceptions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4155) Expose ORC's FileDump as a service
[ https://issues.apache.org/jira/browse/HIVE-4155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616903#comment-13616903 ] Gang Tim Liu commented on HIVE-4155: Forgot to mention: tests passed. sorry Expose ORC's FileDump as a service -- Key: HIVE-4155 URL: https://issues.apache.org/jira/browse/HIVE-4155 Project: Hive Issue Type: New Feature Affects Versions: 0.11.0 Reporter: Kevin Wilfong Assignee: Kevin Wilfong Fix For: 0.11.0 Attachments: HIVE-4155.1.patch.txt Expose ORC's FileDump class as a service similar to RC File Cat e.g. hive --orcfiledump path_to_file Should run FileDump on the file. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4235) CREATE TABLE IF NOT EXISTS uses inefficient way to check if table exists
[ https://issues.apache.org/jira/browse/HIVE-4235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616964#comment-13616964 ] Gang Tim Liu commented on HIVE-4235: Kevin, thank you very much. Tim CREATE TABLE IF NOT EXISTS uses inefficient way to check if table exists Key: HIVE-4235 URL: https://issues.apache.org/jira/browse/HIVE-4235 Project: Hive Issue Type: Bug Components: JDBC, Query Processor, SQL Reporter: Gang Tim Liu Assignee: Gang Tim Liu Fix For: 0.11.0 Attachments: HIVE-4235.patch.1 CREATE TABLE IF NOT EXISTS uses inefficient way to check if table exists. It uses Hive.java's getTablesByPattern(...) to check if table exists. It involves regular expression and eventually database join. Very efficient. It can cause database lock time increase and hurt db performance if a lot of such commands hit database. The suggested approach is to use getTable(...) since we know tablename already -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-3958) support partial scan for analyze command - RCFile
[ https://issues.apache.org/jira/browse/HIVE-3958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615329#comment-13615329 ] Gang Tim Liu commented on HIVE-3958: Namit thank you very much Sent from my iPhone support partial scan for analyze command - RCFile - Key: HIVE-3958 URL: https://issues.apache.org/jira/browse/HIVE-3958 Project: Hive Issue Type: Improvement Reporter: Gang Tim Liu Assignee: Gang Tim Liu Fix For: 0.11.0 Attachments: HIVE-3958.patch.1, HIVE-3958.patch.2, HIVE-3958.patch.3, HIVE-3958.patch.4, HIVE-3958.patch.5, HIVE-3958.patch.6 analyze commands allows us to collect statistics on existing tables/partitions. It works great but might be slow since it scans all files. There are 2 ways to speed it up: 1. collect stats without file scan. It may not collect all stats but good and fast enough for use case. HIVE-3917 addresses it 2. collect stats via partial file scan. It doesn't scan all content of files but part of it to get file metadata. some examples are https://cwiki.apache.org/Hive/rcfilecat.html for RCFile, ORC ( HIVE-3874 ) and HFile of Hbase This jira is targeted to address the #2. More specifically RCFile format. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-3958) support partial scan for analyze command - RCFile
[ https://issues.apache.org/jira/browse/HIVE-3958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Tim Liu updated HIVE-3958: --- Attachment: HIVE-3958.patch.5 support partial scan for analyze command - RCFile - Key: HIVE-3958 URL: https://issues.apache.org/jira/browse/HIVE-3958 Project: Hive Issue Type: Improvement Reporter: Gang Tim Liu Assignee: Gang Tim Liu Attachments: HIVE-3958.patch.1, HIVE-3958.patch.2, HIVE-3958.patch.3, HIVE-3958.patch.4, HIVE-3958.patch.5 analyze commands allows us to collect statistics on existing tables/partitions. It works great but might be slow since it scans all files. There are 2 ways to speed it up: 1. collect stats without file scan. It may not collect all stats but good and fast enough for use case. HIVE-3917 addresses it 2. collect stats via partial file scan. It doesn't scan all content of files but part of it to get file metadata. some examples are https://cwiki.apache.org/Hive/rcfilecat.html for RCFile, ORC ( HIVE-3874 ) and HFile of Hbase This jira is targeted to address the #2. More specifically RCFile format. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Work stopped] (HIVE-3958) support partial scan for analyze command - RCFile
[ https://issues.apache.org/jira/browse/HIVE-3958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HIVE-3958 stopped by Gang Tim Liu. support partial scan for analyze command - RCFile - Key: HIVE-3958 URL: https://issues.apache.org/jira/browse/HIVE-3958 Project: Hive Issue Type: Improvement Reporter: Gang Tim Liu Assignee: Gang Tim Liu Attachments: HIVE-3958.patch.1, HIVE-3958.patch.2, HIVE-3958.patch.3, HIVE-3958.patch.4, HIVE-3958.patch.5 analyze commands allows us to collect statistics on existing tables/partitions. It works great but might be slow since it scans all files. There are 2 ways to speed it up: 1. collect stats without file scan. It may not collect all stats but good and fast enough for use case. HIVE-3917 addresses it 2. collect stats via partial file scan. It doesn't scan all content of files but part of it to get file metadata. some examples are https://cwiki.apache.org/Hive/rcfilecat.html for RCFile, ORC ( HIVE-3874 ) and HFile of Hbase This jira is targeted to address the #2. More specifically RCFile format. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-3958) support partial scan for analyze command - RCFile
[ https://issues.apache.org/jira/browse/HIVE-3958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Tim Liu updated HIVE-3958: --- Status: Patch Available (was: In Progress) Another diff is ready. thanks support partial scan for analyze command - RCFile - Key: HIVE-3958 URL: https://issues.apache.org/jira/browse/HIVE-3958 Project: Hive Issue Type: Improvement Reporter: Gang Tim Liu Assignee: Gang Tim Liu Attachments: HIVE-3958.patch.1, HIVE-3958.patch.2, HIVE-3958.patch.3, HIVE-3958.patch.4, HIVE-3958.patch.5 analyze commands allows us to collect statistics on existing tables/partitions. It works great but might be slow since it scans all files. There are 2 ways to speed it up: 1. collect stats without file scan. It may not collect all stats but good and fast enough for use case. HIVE-3917 addresses it 2. collect stats via partial file scan. It doesn't scan all content of files but part of it to get file metadata. some examples are https://cwiki.apache.org/Hive/rcfilecat.html for RCFile, ORC ( HIVE-3874 ) and HFile of Hbase This jira is targeted to address the #2. More specifically RCFile format. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Work started] (HIVE-3958) support partial scan for analyze command - RCFile
[ https://issues.apache.org/jira/browse/HIVE-3958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HIVE-3958 started by Gang Tim Liu. support partial scan for analyze command - RCFile - Key: HIVE-3958 URL: https://issues.apache.org/jira/browse/HIVE-3958 Project: Hive Issue Type: Improvement Reporter: Gang Tim Liu Assignee: Gang Tim Liu Attachments: HIVE-3958.patch.1, HIVE-3958.patch.2, HIVE-3958.patch.3, HIVE-3958.patch.4, HIVE-3958.patch.5 analyze commands allows us to collect statistics on existing tables/partitions. It works great but might be slow since it scans all files. There are 2 ways to speed it up: 1. collect stats without file scan. It may not collect all stats but good and fast enough for use case. HIVE-3917 addresses it 2. collect stats via partial file scan. It doesn't scan all content of files but part of it to get file metadata. some examples are https://cwiki.apache.org/Hive/rcfilecat.html for RCFile, ORC ( HIVE-3874 ) and HFile of Hbase This jira is targeted to address the #2. More specifically RCFile format. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-3958) support partial scan for analyze command - RCFile
[ https://issues.apache.org/jira/browse/HIVE-3958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Tim Liu updated HIVE-3958: --- Attachment: HIVE-3958.patch.6 support partial scan for analyze command - RCFile - Key: HIVE-3958 URL: https://issues.apache.org/jira/browse/HIVE-3958 Project: Hive Issue Type: Improvement Reporter: Gang Tim Liu Assignee: Gang Tim Liu Attachments: HIVE-3958.patch.1, HIVE-3958.patch.2, HIVE-3958.patch.3, HIVE-3958.patch.4, HIVE-3958.patch.5, HIVE-3958.patch.6 analyze commands allows us to collect statistics on existing tables/partitions. It works great but might be slow since it scans all files. There are 2 ways to speed it up: 1. collect stats without file scan. It may not collect all stats but good and fast enough for use case. HIVE-3917 addresses it 2. collect stats via partial file scan. It doesn't scan all content of files but part of it to get file metadata. some examples are https://cwiki.apache.org/Hive/rcfilecat.html for RCFile, ORC ( HIVE-3874 ) and HFile of Hbase This jira is targeted to address the #2. More specifically RCFile format. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HIVE-4235) CREATE TABLE IF NOT EXISTS uses inefficient way to check if table exists
Gang Tim Liu created HIVE-4235: -- Summary: CREATE TABLE IF NOT EXISTS uses inefficient way to check if table exists Key: HIVE-4235 URL: https://issues.apache.org/jira/browse/HIVE-4235 Project: Hive Issue Type: Bug Components: JDBC, Query Processor, SQL Reporter: Gang Tim Liu Assignee: Gang Tim Liu CREATE TABLE IF NOT EXISTS uses inefficient way to check if table exists. It uses Hive.java's getTablesByPattern(...) to check if table exists. It involves regular expression and eventually database join. Very efficient. May cause database lock time increases and hurt db performance if a lot of such commands hit database. The suggested approach is to use getTable(...) since we know tablename already -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-4235) CREATE TABLE IF NOT EXISTS uses inefficient way to check if table exists
[ https://issues.apache.org/jira/browse/HIVE-4235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Tim Liu updated HIVE-4235: --- Description: CREATE TABLE IF NOT EXISTS uses inefficient way to check if table exists. It uses Hive.java's getTablesByPattern(...) to check if table exists. It involves regular expression and eventually database join. Very efficient. It can cause database lock time increase and hurt db performance if a lot of such commands hit database. The suggested approach is to use getTable(...) since we know tablename already was: CREATE TABLE IF NOT EXISTS uses inefficient way to check if table exists. It uses Hive.java's getTablesByPattern(...) to check if table exists. It involves regular expression and eventually database join. Very efficient. May cause database lock time increases and hurt db performance if a lot of such commands hit database. The suggested approach is to use getTable(...) since we know tablename already CREATE TABLE IF NOT EXISTS uses inefficient way to check if table exists Key: HIVE-4235 URL: https://issues.apache.org/jira/browse/HIVE-4235 Project: Hive Issue Type: Bug Components: JDBC, Query Processor, SQL Reporter: Gang Tim Liu Assignee: Gang Tim Liu CREATE TABLE IF NOT EXISTS uses inefficient way to check if table exists. It uses Hive.java's getTablesByPattern(...) to check if table exists. It involves regular expression and eventually database join. Very efficient. It can cause database lock time increase and hurt db performance if a lot of such commands hit database. The suggested approach is to use getTable(...) since we know tablename already -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Work started] (HIVE-4235) CREATE TABLE IF NOT EXISTS uses inefficient way to check if table exists
[ https://issues.apache.org/jira/browse/HIVE-4235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HIVE-4235 started by Gang Tim Liu. CREATE TABLE IF NOT EXISTS uses inefficient way to check if table exists Key: HIVE-4235 URL: https://issues.apache.org/jira/browse/HIVE-4235 Project: Hive Issue Type: Bug Components: JDBC, Query Processor, SQL Reporter: Gang Tim Liu Assignee: Gang Tim Liu CREATE TABLE IF NOT EXISTS uses inefficient way to check if table exists. It uses Hive.java's getTablesByPattern(...) to check if table exists. It involves regular expression and eventually database join. Very efficient. It can cause database lock time increase and hurt db performance if a lot of such commands hit database. The suggested approach is to use getTable(...) since we know tablename already -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4235) CREATE TABLE IF NOT EXISTS uses inefficient way to check if table exists
[ https://issues.apache.org/jira/browse/HIVE-4235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13614649#comment-13614649 ] Gang Tim Liu commented on HIVE-4235: https://reviews.facebook.net/D9729 CREATE TABLE IF NOT EXISTS uses inefficient way to check if table exists Key: HIVE-4235 URL: https://issues.apache.org/jira/browse/HIVE-4235 Project: Hive Issue Type: Bug Components: JDBC, Query Processor, SQL Reporter: Gang Tim Liu Assignee: Gang Tim Liu Attachments: HIVE-4235.patch.1 CREATE TABLE IF NOT EXISTS uses inefficient way to check if table exists. It uses Hive.java's getTablesByPattern(...) to check if table exists. It involves regular expression and eventually database join. Very efficient. It can cause database lock time increase and hurt db performance if a lot of such commands hit database. The suggested approach is to use getTable(...) since we know tablename already -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-4235) CREATE TABLE IF NOT EXISTS uses inefficient way to check if table exists
[ https://issues.apache.org/jira/browse/HIVE-4235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Tim Liu updated HIVE-4235: --- Attachment: HIVE-4235.patch.1 CREATE TABLE IF NOT EXISTS uses inefficient way to check if table exists Key: HIVE-4235 URL: https://issues.apache.org/jira/browse/HIVE-4235 Project: Hive Issue Type: Bug Components: JDBC, Query Processor, SQL Reporter: Gang Tim Liu Assignee: Gang Tim Liu Attachments: HIVE-4235.patch.1 CREATE TABLE IF NOT EXISTS uses inefficient way to check if table exists. It uses Hive.java's getTablesByPattern(...) to check if table exists. It involves regular expression and eventually database join. Very efficient. It can cause database lock time increase and hurt db performance if a lot of such commands hit database. The suggested approach is to use getTable(...) since we know tablename already -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-4235) CREATE TABLE IF NOT EXISTS uses inefficient way to check if table exists
[ https://issues.apache.org/jira/browse/HIVE-4235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Tim Liu updated HIVE-4235: --- Status: Patch Available (was: In Progress) diff ready CREATE TABLE IF NOT EXISTS uses inefficient way to check if table exists Key: HIVE-4235 URL: https://issues.apache.org/jira/browse/HIVE-4235 Project: Hive Issue Type: Bug Components: JDBC, Query Processor, SQL Reporter: Gang Tim Liu Assignee: Gang Tim Liu Attachments: HIVE-4235.patch.1 CREATE TABLE IF NOT EXISTS uses inefficient way to check if table exists. It uses Hive.java's getTablesByPattern(...) to check if table exists. It involves regular expression and eventually database join. Very efficient. It can cause database lock time increase and hurt db performance if a lot of such commands hit database. The suggested approach is to use getTable(...) since we know tablename already -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-3958) support partial scan for analyze command - RCFile
[ https://issues.apache.org/jira/browse/HIVE-3958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13612824#comment-13612824 ] Gang Tim Liu commented on HIVE-3958: new diff is ready. thanks support partial scan for analyze command - RCFile - Key: HIVE-3958 URL: https://issues.apache.org/jira/browse/HIVE-3958 Project: Hive Issue Type: Improvement Reporter: Gang Tim Liu Assignee: Gang Tim Liu Attachments: HIVE-3958.patch.1, HIVE-3958.patch.2, HIVE-3958.patch.3, HIVE-3958.patch.4 analyze commands allows us to collect statistics on existing tables/partitions. It works great but might be slow since it scans all files. There are 2 ways to speed it up: 1. collect stats without file scan. It may not collect all stats but good and fast enough for use case. HIVE-3917 addresses it 2. collect stats via partial file scan. It doesn't scan all content of files but part of it to get file metadata. some examples are https://cwiki.apache.org/Hive/rcfilecat.html for RCFile, ORC ( HIVE-3874 ) and HFile of Hbase This jira is targeted to address the #2. More specifically RCFile format. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-4219) explain dependency does not capture the input table
[ https://issues.apache.org/jira/browse/HIVE-4219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Tim Liu updated HIVE-4219: --- Attachment: hive.4219.3.patch explain dependency does not capture the input table --- Key: HIVE-4219 URL: https://issues.apache.org/jira/browse/HIVE-4219 Project: Hive Issue Type: Bug Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Attachments: hive.4219.1.patch, hive.4219.2.patch, hive.4219.3.patch hive explain dependency select * from srcpart where ds is not null; OK {input_partitions:[{partitionName:default@srcpart@ds=2008-04-08/hr=11},{partitionName:default@srcpart@ds=2008-04-08/hr=12},{partitionName:default@srcpart@ds=2008-04-09/hr=11},{partitionName:default@srcpart@ds=2008-04-09/hr=12}],input_tables:[]} input_tables should contain srcpart -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Work started] (HIVE-3958) support partial scan for analyze command - RCFile
[ https://issues.apache.org/jira/browse/HIVE-3958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HIVE-3958 started by Gang Tim Liu. support partial scan for analyze command - RCFile - Key: HIVE-3958 URL: https://issues.apache.org/jira/browse/HIVE-3958 Project: Hive Issue Type: Improvement Reporter: Gang Tim Liu Assignee: Gang Tim Liu Attachments: HIVE-3958.patch.1, HIVE-3958.patch.2, HIVE-3958.patch.3 analyze commands allows us to collect statistics on existing tables/partitions. It works great but might be slow since it scans all files. There are 2 ways to speed it up: 1. collect stats without file scan. It may not collect all stats but good and fast enough for use case. HIVE-3917 addresses it 2. collect stats via partial file scan. It doesn't scan all content of files but part of it to get file metadata. some examples are https://cwiki.apache.org/Hive/rcfilecat.html for RCFile, ORC ( HIVE-3874 ) and HFile of Hbase This jira is targeted to address the #2. More specifically RCFile format. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-3958) support partial scan for analyze command - RCFile
[ https://issues.apache.org/jira/browse/HIVE-3958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Tim Liu updated HIVE-3958: --- Attachment: HIVE-3958.patch.3 support partial scan for analyze command - RCFile - Key: HIVE-3958 URL: https://issues.apache.org/jira/browse/HIVE-3958 Project: Hive Issue Type: Improvement Reporter: Gang Tim Liu Assignee: Gang Tim Liu Attachments: HIVE-3958.patch.1, HIVE-3958.patch.2, HIVE-3958.patch.3 analyze commands allows us to collect statistics on existing tables/partitions. It works great but might be slow since it scans all files. There are 2 ways to speed it up: 1. collect stats without file scan. It may not collect all stats but good and fast enough for use case. HIVE-3917 addresses it 2. collect stats via partial file scan. It doesn't scan all content of files but part of it to get file metadata. some examples are https://cwiki.apache.org/Hive/rcfilecat.html for RCFile, ORC ( HIVE-3874 ) and HFile of Hbase This jira is targeted to address the #2. More specifically RCFile format. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-3958) support partial scan for analyze command - RCFile
[ https://issues.apache.org/jira/browse/HIVE-3958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Tim Liu updated HIVE-3958: --- Status: Patch Available (was: In Progress) Another diff is ready for review. support partial scan for analyze command - RCFile - Key: HIVE-3958 URL: https://issues.apache.org/jira/browse/HIVE-3958 Project: Hive Issue Type: Improvement Reporter: Gang Tim Liu Assignee: Gang Tim Liu Attachments: HIVE-3958.patch.1, HIVE-3958.patch.2, HIVE-3958.patch.3 analyze commands allows us to collect statistics on existing tables/partitions. It works great but might be slow since it scans all files. There are 2 ways to speed it up: 1. collect stats without file scan. It may not collect all stats but good and fast enough for use case. HIVE-3917 addresses it 2. collect stats via partial file scan. It doesn't scan all content of files but part of it to get file metadata. some examples are https://cwiki.apache.org/Hive/rcfilecat.html for RCFile, ORC ( HIVE-3874 ) and HFile of Hbase This jira is targeted to address the #2. More specifically RCFile format. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4219) explain dependency does not capture the input table
[ https://issues.apache.org/jira/browse/HIVE-4219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13611047#comment-13611047 ] Gang Tim Liu commented on HIVE-4219: +1 explain dependency does not capture the input table --- Key: HIVE-4219 URL: https://issues.apache.org/jira/browse/HIVE-4219 Project: Hive Issue Type: Bug Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Attachments: hive.4219.1.patch, hive.4219.2.patch hive explain dependency select * from srcpart where ds is not null; OK {input_partitions:[{partitionName:default@srcpart@ds=2008-04-08/hr=11},{partitionName:default@srcpart@ds=2008-04-08/hr=12},{partitionName:default@srcpart@ds=2008-04-09/hr=11},{partitionName:default@srcpart@ds=2008-04-09/hr=12}],input_tables:[]} input_tables should contain srcpart -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4206) Sort merge join does not work for outer joins for 7 inputs
[ https://issues.apache.org/jira/browse/HIVE-4206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13609121#comment-13609121 ] Gang Tim Liu commented on HIVE-4206: +1 Sort merge join does not work for outer joins for 7 inputs -- Key: HIVE-4206 URL: https://issues.apache.org/jira/browse/HIVE-4206 Project: Hive Issue Type: Improvement Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Attachments: hive.4206.1.patch, hive.4206.2.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4213) List bucketing error too restrictive
[ https://issues.apache.org/jira/browse/HIVE-4213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13609586#comment-13609586 ] Gang Tim Liu commented on HIVE-4213: [~mgrover] I am a little confused. Please correct me. The current logic is not restrictive. For example, it is legal for the following case: set hive.mapred.supports.subdirectories=true; set mapred.input.dir.recursive=true; set hive.optimize.listbucketing=false; List bucketing error too restrictive Key: HIVE-4213 URL: https://issues.apache.org/jira/browse/HIVE-4213 Project: Hive Issue Type: Bug Affects Versions: 0.10.0 Reporter: Mark Grover Fix For: 0.11.0 With the introduction of List bucketing, we introduced a config validation step where we say: {code} SUPPORT_DIR_MUST_TRUE_FOR_LIST_BUCKETING( 10199, hive.mapred.supports.subdirectories must be true + if any one of following is true: hive.internal.ddl.list.bucketing.enable, + hive.optimize.listbucketing and mapred.input.dir.recursive), {code} This seems overly restrictive to because there are use cases where people may want to use {{mapred.input.dir.recursive}} to {{true}} even when they don't care about list bucketing. Is that not true? For example, here is the unit test code for {{clientpositive/recursive_dir.q}} {code} CREATE TABLE fact_daily(x int) PARTITIONED BY (ds STRING); CREATE TABLE fact_tz(x int) PARTITIONED BY (ds STRING, hr STRING) LOCATION 'pfile:${system:test.tmp.dir}/fact_tz'; INSERT OVERWRITE TABLE fact_tz PARTITION (ds='1', hr='1') SELECT key+11 FROM src WHERE key=484; ALTER TABLE fact_daily SET TBLPROPERTIES('EXTERNAL'='TRUE'); ALTER TABLE fact_daily ADD PARTITION (ds='1') LOCATION 'pfile:${system:test.tmp.dir}/fact_tz/ds=1'; set hive.mapred.supports.subdirectories=true; set mapred.input.dir.recursive=true; set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; SELECT * FROM fact_daily WHERE ds='1'; SELECT count(1) FROM fact_daily WHERE ds='1'; {code} The unit test doesn't seem to be concerned about list bucketing but wants to set {{mapred.input.dir.recursive}} to {{true}}. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4146) bug with hive.auto.convert.join.noconditionaltask with outer joins
[ https://issues.apache.org/jira/browse/HIVE-4146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13607672#comment-13607672 ] Gang Tim Liu commented on HIVE-4146: +1 bug with hive.auto.convert.join.noconditionaltask with outer joins -- Key: HIVE-4146 URL: https://issues.apache.org/jira/browse/HIVE-4146 Project: Hive Issue Type: Bug Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Attachments: hive.4146.1.patch, hive.4146.2.patch, hive.4146.3.patch, hive.4146.4.patch, hive.4146.5.patch, hive.4146.6.patch Consider the following scenario: create table s1 as select * from src where key = 0; set hive.auto.convert.join.noconditionaltask=false; SELECT * FROM s1 src1 LEFT OUTER JOIN s1 src2 ON (src1.key = src2.key AND src2.key 10); gives correct results 0 val_0 NULLNULL 0 val_0 NULLNULL 0 val_0 NULLNULL whereas it gives no results with hive.auto.convert.join.noconditionaltask set to true -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4146) bug with hive.auto.convert.join.noconditionaltask with outer joins
[ https://issues.apache.org/jira/browse/HIVE-4146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13607670#comment-13607670 ] Gang Tim Liu commented on HIVE-4146: comment is false positive. bug with hive.auto.convert.join.noconditionaltask with outer joins -- Key: HIVE-4146 URL: https://issues.apache.org/jira/browse/HIVE-4146 Project: Hive Issue Type: Bug Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Attachments: hive.4146.1.patch, hive.4146.2.patch, hive.4146.3.patch, hive.4146.4.patch, hive.4146.5.patch, hive.4146.6.patch Consider the following scenario: create table s1 as select * from src where key = 0; set hive.auto.convert.join.noconditionaltask=false; SELECT * FROM s1 src1 LEFT OUTER JOIN s1 src2 ON (src1.key = src2.key AND src2.key 10); gives correct results 0 val_0 NULLNULL 0 val_0 NULLNULL 0 val_0 NULLNULL whereas it gives no results with hive.auto.convert.join.noconditionaltask set to true -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4146) bug with hive.auto.convert.join.noconditionaltask with outer joins
[ https://issues.apache.org/jira/browse/HIVE-4146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13607307#comment-13607307 ] Gang Tim Liu commented on HIVE-4146: A very small comment in D9327. bug with hive.auto.convert.join.noconditionaltask with outer joins -- Key: HIVE-4146 URL: https://issues.apache.org/jira/browse/HIVE-4146 Project: Hive Issue Type: Bug Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Attachments: hive.4146.1.patch, hive.4146.2.patch, hive.4146.3.patch, hive.4146.4.patch, hive.4146.5.patch, hive.4146.6.patch Consider the following scenario: create table s1 as select * from src where key = 0; set hive.auto.convert.join.noconditionaltask=false; SELECT * FROM s1 src1 LEFT OUTER JOIN s1 src2 ON (src1.key = src2.key AND src2.key 10); gives correct results 0 val_0 NULLNULL 0 val_0 NULLNULL 0 val_0 NULLNULL whereas it gives no results with hive.auto.convert.join.noconditionaltask set to true -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-3958) support partial scan for analyze command - RCFile
[ https://issues.apache.org/jira/browse/HIVE-3958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Tim Liu updated HIVE-3958: --- Attachment: HIVE-3958.patch.2 support partial scan for analyze command - RCFile - Key: HIVE-3958 URL: https://issues.apache.org/jira/browse/HIVE-3958 Project: Hive Issue Type: Improvement Reporter: Gang Tim Liu Assignee: Gang Tim Liu Attachments: HIVE-3958.patch.1, HIVE-3958.patch.2 analyze commands allows us to collect statistics on existing tables/partitions. It works great but might be slow since it scans all files. There are 2 ways to speed it up: 1. collect stats without file scan. It may not collect all stats but good and fast enough for use case. HIVE-3917 addresses it 2. collect stats via partial file scan. It doesn't scan all content of files but part of it to get file metadata. some examples are https://cwiki.apache.org/Hive/rcfilecat.html for RCFile, ORC ( HIVE-3874 ) and HFile of Hbase This jira is targeted to address the #2. More specifically RCFile format. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4145) Create hcatalog stub directory and add it to the build
[ https://issues.apache.org/jira/browse/HIVE-4145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13603604#comment-13603604 ] Gang Tim Liu commented on HIVE-4145: +1 Create hcatalog stub directory and add it to the build -- Key: HIVE-4145 URL: https://issues.apache.org/jira/browse/HIVE-4145 Project: Hive Issue Type: Task Components: Build Infrastructure Reporter: Carl Steinbach Assignee: Carl Steinbach Attachments: HIVE-4145.1.patch.txt Alan has requested that we create a directory for hcatalog and give the HCatalog submodule committers karma on it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-3958) support partial scan for analyze command - RCFile
[ https://issues.apache.org/jira/browse/HIVE-3958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Tim Liu updated HIVE-3958: --- Attachment: HIVE-3958.patch.1 support partial scan for analyze command - RCFile - Key: HIVE-3958 URL: https://issues.apache.org/jira/browse/HIVE-3958 Project: Hive Issue Type: Improvement Reporter: Gang Tim Liu Assignee: Gang Tim Liu Attachments: HIVE-3958.patch.1 analyze commands allows us to collect statistics on existing tables/partitions. It works great but might be slow since it scans all files. There are 2 ways to speed it up: 1. collect stats without file scan. It may not collect all stats but good and fast enough for use case. HIVE-3917 addresses it 2. collect stats via partial file scan. It doesn't scan all content of files but part of it to get file metadata. some examples are https://cwiki.apache.org/Hive/rcfilecat.html for RCFile, ORC ( HIVE-3874 ) and HFile of Hbase This jira is targeted to address the #2. More specifically RCFile format. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-3958) support partial scan for analyze command - RCFile
[ https://issues.apache.org/jira/browse/HIVE-3958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Tim Liu updated HIVE-3958: --- Status: Patch Available (was: In Progress) support partial scan for analyze command - RCFile - Key: HIVE-3958 URL: https://issues.apache.org/jira/browse/HIVE-3958 Project: Hive Issue Type: Improvement Reporter: Gang Tim Liu Assignee: Gang Tim Liu Attachments: HIVE-3958.patch.1 analyze commands allows us to collect statistics on existing tables/partitions. It works great but might be slow since it scans all files. There are 2 ways to speed it up: 1. collect stats without file scan. It may not collect all stats but good and fast enough for use case. HIVE-3917 addresses it 2. collect stats via partial file scan. It doesn't scan all content of files but part of it to get file metadata. some examples are https://cwiki.apache.org/Hive/rcfilecat.html for RCFile, ORC ( HIVE-3874 ) and HFile of Hbase This jira is targeted to address the #2. More specifically RCFile format. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-3958) support partial scan for analyze command - RCFile
[ https://issues.apache.org/jira/browse/HIVE-3958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Tim Liu updated HIVE-3958: --- Summary: support partial scan for analyze command - RCFile (was: support partial scan for analyze command) support partial scan for analyze command - RCFile - Key: HIVE-3958 URL: https://issues.apache.org/jira/browse/HIVE-3958 Project: Hive Issue Type: Improvement Reporter: Gang Tim Liu Assignee: Gang Tim Liu analyze commands allows us to collect statistics on existing tables/partitions. It works great but might be slow since it scans all files. There are 2 ways to speed it up: 1. collect stats without file scan. It may not collect all stats but good and fast enough for use case. HIVE-3917 addresses it 2. collect stats via partial file scan. It doesn't scan all content of files but part of it to get file metadata. some examples are https://cwiki.apache.org/Hive/rcfilecat.html for RCFile, ORC ( HIVE-3874 ) and HFile of Hbase This jira is targeted to address the #2. More specifically RCFile format. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HIVE-4177) support partial scan for analyze command - ORC
Gang Tim Liu created HIVE-4177: -- Summary: support partial scan for analyze command - ORC Key: HIVE-4177 URL: https://issues.apache.org/jira/browse/HIVE-4177 Project: Hive Issue Type: Improvement Reporter: Gang Tim Liu Assignee: Gang Tim Liu This is follow up on hive 3958. This jira will focus on ORC format -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-3958) support partial scan for analyze command - RCFile
[ https://issues.apache.org/jira/browse/HIVE-3958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13602827#comment-13602827 ] Gang Tim Liu commented on HIVE-3958: submit a follow up HIVE-4177 which focuses on ORC. support partial scan for analyze command - RCFile - Key: HIVE-3958 URL: https://issues.apache.org/jira/browse/HIVE-3958 Project: Hive Issue Type: Improvement Reporter: Gang Tim Liu Assignee: Gang Tim Liu analyze commands allows us to collect statistics on existing tables/partitions. It works great but might be slow since it scans all files. There are 2 ways to speed it up: 1. collect stats without file scan. It may not collect all stats but good and fast enough for use case. HIVE-3917 addresses it 2. collect stats via partial file scan. It doesn't scan all content of files but part of it to get file metadata. some examples are https://cwiki.apache.org/Hive/rcfilecat.html for RCFile, ORC ( HIVE-3874 ) and HFile of Hbase This jira is targeted to address the #2. More specifically RCFile format. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-3958) support partial scan for analyze command - RCFile
[ https://issues.apache.org/jira/browse/HIVE-3958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13602828#comment-13602828 ] Gang Tim Liu commented on HIVE-3958: Initial draft https://reviews.facebook.net/D9417 support partial scan for analyze command - RCFile - Key: HIVE-3958 URL: https://issues.apache.org/jira/browse/HIVE-3958 Project: Hive Issue Type: Improvement Reporter: Gang Tim Liu Assignee: Gang Tim Liu analyze commands allows us to collect statistics on existing tables/partitions. It works great but might be slow since it scans all files. There are 2 ways to speed it up: 1. collect stats without file scan. It may not collect all stats but good and fast enough for use case. HIVE-3917 addresses it 2. collect stats via partial file scan. It doesn't scan all content of files but part of it to get file metadata. some examples are https://cwiki.apache.org/Hive/rcfilecat.html for RCFile, ORC ( HIVE-3874 ) and HFile of Hbase This jira is targeted to address the #2. More specifically RCFile format. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-4177) support partial scan for analyze command - ORC
[ https://issues.apache.org/jira/browse/HIVE-4177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Tim Liu updated HIVE-4177: --- Description: This is follow up on Hive-3958. This jira will focus on ORC format HIVE-3874 was: This is follow up on hive 3958. This jira will focus on ORC format support partial scan for analyze command - ORC -- Key: HIVE-4177 URL: https://issues.apache.org/jira/browse/HIVE-4177 Project: Hive Issue Type: Improvement Reporter: Gang Tim Liu Assignee: Gang Tim Liu This is follow up on Hive-3958. This jira will focus on ORC format HIVE-3874 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-4177) support partial scan for analyze command - ORC
[ https://issues.apache.org/jira/browse/HIVE-4177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Tim Liu updated HIVE-4177: --- Description: This is follow up on HIVE-3958. This jira will focus on ORC format HIVE-3874 was: This is follow up on Hive-3958. This jira will focus on ORC format HIVE-3874 support partial scan for analyze command - ORC -- Key: HIVE-4177 URL: https://issues.apache.org/jira/browse/HIVE-4177 Project: Hive Issue Type: Improvement Reporter: Gang Tim Liu Assignee: Gang Tim Liu This is follow up on HIVE-3958. This jira will focus on ORC format HIVE-3874 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HIVE-4150) optimize queries like 'select count(1) from T where conditions on partition columns'
[ https://issues.apache.org/jira/browse/HIVE-4150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Tim Liu reassigned HIVE-4150: -- Assignee: Gang Tim Liu optimize queries like 'select count(1) from T where conditions on partition columns' -- Key: HIVE-4150 URL: https://issues.apache.org/jira/browse/HIVE-4150 Project: Hive Issue Type: Improvement Components: Query Processor Reporter: Namit Jain Assignee: Gang Tim Liu If accurate stats are available in the metastore, they should be used to optimize the above query. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira