[jira] [Created] (HIVE-16158) Correct mistake in documentation for ALTER TABLE … ADD/REPLACE COLUMNS CASCADE
Illya Yalovyy created HIVE-16158: Summary: Correct mistake in documentation for ALTER TABLE … ADD/REPLACE COLUMNS CASCADE Key: HIVE-16158 URL: https://issues.apache.org/jira/browse/HIVE-16158 Project: Hive Issue Type: Bug Components: Documentation Affects Versions: 1.0.0 Reporter: Illya Yalovyy Current documentation says that key word CASCADE was introduced in Hive 0.15 release. That information is incorrect and confuses users. The feature was actually released in Hive 1.1.0. (HIVE-8839) https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Add/ReplaceColumns -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-15076) Improve scalability of LDAP authentication provider group filter
Illya Yalovyy created HIVE-15076: Summary: Improve scalability of LDAP authentication provider group filter Key: HIVE-15076 URL: https://issues.apache.org/jira/browse/HIVE-15076 Project: Hive Issue Type: Improvement Components: Authentication Affects Versions: 2.1.0 Reporter: Illya Yalovyy Assignee: Illya Yalovyy Current implementation uses following algorithm: # For a given user find all groups that user is a member of. (A list of LDAP groups is constructed as a result of that request) # Match this list of groups with provided group filter. Time/Memory complexity of this approach is O(N) on client side, where N – is a number of groups the user has membership in. On a large directory (800+ groups per user) we can observe up to 2x performance degradation and failures because of size of LDAP response (LDAP: error code 4 - Sizelimit Exceeded). Some Directory Services (Microsoft Active Directory for instance) provide a virtual attribute for User Object that contains a list of groups that user belongs to. This attribute can be used to quickly determine whether this user passes or fails the group filter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-14927) Remove code duplication from tests in TestLdapAtnProviderWithMiniDS
Illya Yalovyy created HIVE-14927: Summary: Remove code duplication from tests in TestLdapAtnProviderWithMiniDS Key: HIVE-14927 URL: https://issues.apache.org/jira/browse/HIVE-14927 Project: Hive Issue Type: Improvement Components: Test Reporter: Illya Yalovyy Assignee: Illya Yalovyy * Extract inner class User and implement a proper builder for it. * Extract all common code to LdapAuthenticationTestCase class * setting up the test case * executing test case * result validation -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-14875) Enhancement and refactoring of TestLdapAtnProviderWithMiniDS
Illya Yalovyy created HIVE-14875: Summary: Enhancement and refactoring of TestLdapAtnProviderWithMiniDS Key: HIVE-14875 URL: https://issues.apache.org/jira/browse/HIVE-14875 Project: Hive Issue Type: Test Components: Authentication, Tests Reporter: Illya Yalovyy Assignee: Illya Yalovyy This makes the following enhancements to TestLdapAtnProviderWithMiniDS: * Extract defined ldifs to a resource file. * Remove unneeded attributes defined in each ldif entry such as: * sn (Surname) and givenName from group entries * distinguishedName from all entries as this attribute serves more as a parent type of many other attributes. * Remove setting ExtensibleObject as an objectClass for all ldap entries as that is not needed. This objectClass would allow for adding any attribute to an entry. * Add missing uid attribute to group entries whose dn refer to a uid attribute * Add missing uidObject objectClass to entries that have the uid attribute * Explicitly set organizationalPerson objectClass to user entries as they are using inetOrgPerson objectClass which is a subclass of the organizationalPerson objectClass * Create indexes on cn and uid attributes as they are commonly queried. * Removed unused variables and imports. * Fixed givenName for user3. * Other minor code clean up. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-14713) LDAP Authentication Provider should be covered with unit tests
Illya Yalovyy created HIVE-14713: Summary: LDAP Authentication Provider should be covered with unit tests Key: HIVE-14713 URL: https://issues.apache.org/jira/browse/HIVE-14713 Project: Hive Issue Type: Test Components: Authentication, Tests Affects Versions: 2.1.0 Reporter: Illya Yalovyy Assignee: Illya Yalovyy Currently LdapAuthenticationProviderImpl class is not covered with unit tests. To make this class testable some minor refactoring will be required. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-13510) Dynamic partitioning doesn’t work when remote metastore is used
Illya Yalovyy created HIVE-13510: Summary: Dynamic partitioning doesn’t work when remote metastore is used Key: HIVE-13510 URL: https://issues.apache.org/jira/browse/HIVE-13510 Project: Hive Issue Type: Bug Components: Metastore Affects Versions: 2.1.0 Environment: Hadoop 2.7.1 Reporter: Illya Yalovyy Assignee: Illya Yalovyy Priority: Critical *Steps to reproduce:* # Configure remote metastore (hive.metastore.uris) # Create table t1 (a string); # Create table t2 (a string) partitioned by (b string); # set hive.exec.dynamic.partition.mode=nonstrict; # Insert overwrite table t2 partition (b) select a,a from t1; *Result:* {noformat} FAILED: SemanticException org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.TApplicationException: getMetaConf failed: unknown result 16/04/13 15:04:51 [c679e424-2501-4347-8146-cf1b1cae217c main]: ERROR ql.Driver: FAILED: SemanticException org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.TApplicationException: getMetaConf failed: unknown result org.apache.hadoop.hive.ql.parse.SemanticException: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.TApplicationException: getMetaConf failed: unknown result at org.apache.hadoop.hive.ql.plan.DynamicPartitionCtx.(DynamicPartitionCtx.java:84) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genFileSinkPlan(SemanticAnalyzer.java:6550) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPostGroupByBodyPlan(SemanticAnalyzer.java:9315) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:9204) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:10071) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:9949) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genOPTree(SemanticAnalyzer.java:10607) at org.apache.hadoop.hive.ql.parse.CalcitePlanner.genOPTree(CalcitePlanner.java:358) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10618) at org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:233) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:245) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:476) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:318) at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1192) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1287) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1118) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1106) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:236) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:187) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:339) at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:748) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:721) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:648) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.run(RunJar.java:221) at org.apache.hadoop.util.RunJar.main(RunJar.java:136) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.TApplicationException: getMetaConf failed: unknown result at org.apache.hadoop.hive.ql.metadata.Hive.getMetaConf(Hive.java:3493) at org.apache.hadoop.hive.ql.plan.DynamicPartitionCtx.(DynamicPartitionCtx.java:82) ... 29 more Caused by: org.apache.thrift.TApplicationException: getMetaConf failed: unknown result at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_getMetaConf(ThriftHiveMetastore.java:666) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.getMetaConf(ThriftHiveMetastore.java:646) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getMetaConf(HiveMetaStoreClient.java:550) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at
[jira] [Created] (HIVE-13185) orc.ReaderImp.ensureOrcFooter() method fails on small text files with IndexOutOfBoundsException
Illya Yalovyy created HIVE-13185: Summary: orc.ReaderImp.ensureOrcFooter() method fails on small text files with IndexOutOfBoundsException Key: HIVE-13185 URL: https://issues.apache.org/jira/browse/HIVE-13185 Project: Hive Issue Type: Bug Components: ORC Affects Versions: 2.1.0 Reporter: Illya Yalovyy Steps to reproduce: 1. Create a Text source table with one line of data: {code} create table src (id int); insert overwrite table src values (1); {code} 2. Create a target table: {code} create table trg (id int); {code} 3. Try to load small text file to the target table: {code} load data inpath 'user/hive/warehouse/src/00_0' into table trg; {code} *Error message:* {quote} FAILED: SemanticException Unable to load data to destination table. Error: java.lang.IndexOutOfBoundsException {quote} *Stack trace:* {noformat} org.apache.hadoop.hive.ql.parse.SemanticException: Unable to load data to destination table. Error: java.lang.IndexOutOfBoundsException at org.apache.hadoop.hive.ql.parse.LoadSemanticAnalyzer.ensureFileFormatsMatch(LoadSemanticAnalyzer.java:340) at org.apache.hadoop.hive.ql.parse.LoadSemanticAnalyzer.analyzeInternal(LoadSemanticAnalyzer.java:224) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:242) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:481) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:317) at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1190) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1285) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1116) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1104) ... {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-12715) Unit test for HIVE-10685 fix
Illya Yalovyy created HIVE-12715: Summary: Unit test for HIVE-10685 fix Key: HIVE-12715 URL: https://issues.apache.org/jira/browse/HIVE-12715 Project: Hive Issue Type: Test Reporter: Illya Yalovyy It seems like bugfix provided for HIVE-10685 is not covered with tests. This tricky scenario could happen not only when table gets concatenated but also in some other use cases. I'm going to implement a unit test for it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11882) Fetch optimizer should stop source files traversal once it exceeds the hive.fetch.task.conversion.threshold
Illya Yalovyy created HIVE-11882: Summary: Fetch optimizer should stop source files traversal once it exceeds the hive.fetch.task.conversion.threshold Key: HIVE-11882 URL: https://issues.apache.org/jira/browse/HIVE-11882 Project: Hive Issue Type: Improvement Components: Physical Optimizer Affects Versions: 1.0.0 Reporter: Illya Yalovyy Hive 1.0's fetch optimizer tries to optimize queries of the form "select from where limit " to a fetch task (see the hive.fetch.task.conversion property). This optimization gets the lengths of all the files in the specified partition and does some comparison against a threshold value to determine whether it should use a fetch task or not (see the hive.fetch.task.conversion.threshold property). This process of getting the length of all files. One of the main problems in this optimization is the fetch optimizer doesn't seem to stop once it exceeds the hive.fetch.task.conversion.threshold. It works fine on HDFS, but could cause a significant performance degradation on other supported file systems. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11791) Add test for HIVE-10122
Illya Yalovyy created HIVE-11791: Summary: Add test for HIVE-10122 Key: HIVE-11791 URL: https://issues.apache.org/jira/browse/HIVE-11791 Project: Hive Issue Type: Test Components: Metastore Affects Versions: 1.1.0 Reporter: Illya Yalovyy Priority: Minor Unit tests for PartitionPruner.compactExpr() -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11583) When PTF is used over a large partitions result could be corrupted
Illya Yalovyy created HIVE-11583: Summary: When PTF is used over a large partitions result could be corrupted Key: HIVE-11583 URL: https://issues.apache.org/jira/browse/HIVE-11583 Project: Hive Issue Type: Bug Components: PTF-Windowing Affects Versions: 1.2.1, 1.2.0, 1.0.0, 0.13.1, 0.14.0, 0.14.1 Environment: Hadoop 2.6 + Apache hive built from trunk Reporter: Illya Yalovyy Priority: Critical Dataset: Window has 50001 record (2 blocks on disk and 1 block in memory) Size of the second block is 32Mb (2 splits) Result: When the last block is read from the disk only first split is actually loaded. The second split gets missed. The total count of the result dataset is correct, but some records are missing and another are duplicated. Example: {code:sql} CREATE TABLE ptf_big_src ( id INT, key STRING, grp STRING, value STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; LOAD DATA LOCAL INPATH '../../data/files/ptf_3blocks.txt.gz' OVERWRITE INTO TABLE ptf_big_src; SELECT grp, COUNT(1) cnt FROM ptf_big_trg GROUP BY grp ORDER BY cnt desc; --- -- A25000 -- B2 -- C5001 --- CREATE TABLE ptf_big_trg AS SELECT *, row_number() OVER (PARTITION BY key ORDER BY grp) grp_num FROM ptf_big_src; SELECT grp, COUNT(1) cnt FROM ptf_big_trg GROUP BY grp ORDER BY cnt desc; -- -- A34296 -- B15704 -- C1 --- {code} Counts by 'grp' are incorrect! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-10980) Merge of dynamic partitions loads all data to default partition
Illya Yalovyy created HIVE-10980: Summary: Merge of dynamic partitions loads all data to default partition Key: HIVE-10980 URL: https://issues.apache.org/jira/browse/HIVE-10980 Project: Hive Issue Type: Bug Components: Hive Affects Versions: 0.14.0 Environment: HDP 2.2.4 (also reproduced on apache hive built from trunk) Reporter: Illya Yalovyy Conditions that lead to the issue: 1. Partition columns have different types 2. Both static and dynamic partitions are used in the query 3. Dynamically generated partitions require merge Result: Final data is loaded to __HIVE_DEFAULT_PARTITION__. Steps to reproduce: set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=strict; set hive.optimize.sort.dynamic.partition=false; set hive.merge.mapfiles=true; set hive.merge.mapredfiles=true; set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; create external table sdp ( dataint bigint, hour int, req string, cid string, caid string ) row format delimited fields terminated by ','; load data local inpath '../../data/files/dynpartdata1.txt' into table sdp; load data local inpath '../../data/files/dynpartdata2.txt' into table sdp; ... load data local inpath '../../data/files/dynpartdataN.txt' into table sdp; create table tdp (cid string, caid string) partitioned by (dataint bigint, hour int, req string); insert overwrite table tdp partition (dataint=20150316, hour=16, req) select cid, caid, req from sdp where dataint=20150316 and hour=16; select * from tdp order by caid; show partitions tdp; Example of the input file: 20150316,16,reqA,clusterIdA,cacheId1 20150316,16,reqB,clusterIdB,cacheId2 20150316,16,reqA,clusterIdC,cacheId3 20150316,16,reqD,clusterIdD,cacheId4 20150316,16,reqA,clusterIdA,cacheId5 Actual result: clusterIdA cacheId12015031616 __HIVE_DEFAULT_PARTITION__ clusterIdA cacheId12015031616 __HIVE_DEFAULT_PARTITION__ clusterIdB cacheId22015031616 __HIVE_DEFAULT_PARTITION__ clusterIdC cacheId32015031616 __HIVE_DEFAULT_PARTITION__ clusterIdD cacheId42015031616 __HIVE_DEFAULT_PARTITION__ clusterIdA cacheId52015031616 __HIVE_DEFAULT_PARTITION__ clusterIdD cacheId82015031616 __HIVE_DEFAULT_PARTITION__ clusterIdB cacheId92015031616 __HIVE_DEFAULT_PARTITION__ dataint=20150316/hour=16/req=__HIVE_DEFAULT_PARTITION__ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-9225) Windowing functions are not executing efficiently when the window is identical
Illya Yalovyy created HIVE-9225: --- Summary: Windowing functions are not executing efficiently when the window is identical Key: HIVE-9225 URL: https://issues.apache.org/jira/browse/HIVE-9225 Project: Hive Issue Type: Improvement Components: PTF-Windowing Affects Versions: 0.13.0 Environment: Linux Reporter: Illya Yalovyy Hive optimizer and the runtime are not smart enough to recognize if the windowing is the same. Even when the window is identical, the windowing is re-executed again and cause the runtime increase proportionally to # of windows. Example: {code:sql} select code,min(emp) over (partition by code order by emp range between current row and 3 following)from sample_big limit 10; {code} *Time taken: 1h:36m:12s* {code:sql} select code, min(emp) over (partition by code order by emp range between current row and 3 following), max(emp) over (partition by code order by emp range between current row and 3 following), min(salary) over (partition by code order by emp range between current row and 3 following), max(salary) over (partition by code order by emp range between current row and 3 following) from sample_big limit 10; {code} *Time taken: 4h:0m:37s* -- This message was sent by Atlassian JIRA (v6.3.4#6332)