[jira] [Created] (HIVE-19805) TableScanDesc Use Commons Library
BELUGA BEHR created HIVE-19805: -- Summary: TableScanDesc Use Commons Library Key: HIVE-19805 URL: https://issues.apache.org/jira/browse/HIVE-19805 Project: Hive Issue Type: Improvement Components: Query Planning Affects Versions: 4.0.0 Reporter: BELUGA BEHR Use commons library and remove some code -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-19807) Add Useful Error Message To Table Header/Footer Parsing
BELUGA BEHR created HIVE-19807: -- Summary: Add Useful Error Message To Table Header/Footer Parsing Key: HIVE-19807 URL: https://issues.apache.org/jira/browse/HIVE-19807 Project: Hive Issue Type: Improvement Affects Versions: 2.3.2, 3.0.0 Reporter: BELUGA BEHR Add some useful logging messages to invalid value parsing of {{skip.header.line.count}} and {{skip.footer.line.count}} for better troubleshooting. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-19809) Remove Deprecated Code From Utilities Class
BELUGA BEHR created HIVE-19809: -- Summary: Remove Deprecated Code From Utilities Class Key: HIVE-19809 URL: https://issues.apache.org/jira/browse/HIVE-19809 Project: Hive Issue Type: Improvement Reporter: BELUGA BEHR {quote} This can go away once hive moves to support only JDK 7 and can use Files.createTempDirectory {quote} Remove the {{createTempDir}} method from the {{Utilities}} class. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-19834) Clear Context Map of Paths to ContentSummary
BELUGA BEHR created HIVE-19834: -- Summary: Clear Context Map of Paths to ContentSummary Key: HIVE-19834 URL: https://issues.apache.org/jira/browse/HIVE-19834 Project: Hive Issue Type: Improvement Components: HiveServer2 Affects Versions: 2.3.2, 3.0.0, 4.0.0 Reporter: BELUGA BEHR Attachments: HIVE-19834.1.patch The {{Context}} class has a {{clear}} method which is called. During the method, various files are deleted and in-memory maps are cleared. I would like to propose that we clear out an additional in-memory map structure that may contain a lot of data so that it can be GC'ed asap. This map contains mapping of "File Path"->"Content Summary". For a query with a large file set, this can be quite large. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-19841) Upgrade commons-collections to commons-collections4
BELUGA BEHR created HIVE-19841: -- Summary: Upgrade commons-collections to commons-collections4 Key: HIVE-19841 URL: https://issues.apache.org/jira/browse/HIVE-19841 Project: Hive Issue Type: Task Affects Versions: 3.0.0, 4.0.0 Reporter: BELUGA BEHR Perhaps time to drink the Apache champagne (eat the Apache dog food) and upgrade the commons-collections library from 3.x to 4.x. {code} 3.2.2 {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-19842) Improve Hive UDF Usability
BELUGA BEHR created HIVE-19842: -- Summary: Improve Hive UDF Usability Key: HIVE-19842 URL: https://issues.apache.org/jira/browse/HIVE-19842 Project: Hive Issue Type: Task Components: HiveServer2, UDF Affects Versions: 2.3.2, 3.0.0, 4.0.0 Reporter: BELUGA BEHR Deploying and Managing Hive UDFs is very cumbersome. Please improve. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-19843) Add Function Calls To Standardized Authorization Model
BELUGA BEHR created HIVE-19843: -- Summary: Add Function Calls To Standardized Authorization Model Key: HIVE-19843 URL: https://issues.apache.org/jira/browse/HIVE-19843 Project: Hive Issue Type: Improvement Components: Authorization, HiveServer2 Affects Versions: 2.3.2, 3.0.0, 4.0.0 Reporter: BELUGA BEHR As far as I know, the CREATE FUNCTION statement is protected by standard Hive authorization model, however, actually using the function is open to anyone once it has been created by the Hive administrator. Please also make each call to a function subject to an authorization check to protect the functions from un-authorized users. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-19844) Make CSV SerDe First-Class SerDe
BELUGA BEHR created HIVE-19844: -- Summary: Make CSV SerDe First-Class SerDe Key: HIVE-19844 URL: https://issues.apache.org/jira/browse/HIVE-19844 Project: Hive Issue Type: Improvement Components: HiveServer2, Serializers/Deserializers Affects Versions: 2.3.2, 3.0.0, 4.0.0 Reporter: BELUGA BEHR According to the [Hive SerDe Docs|https://cwiki.apache.org/confluence/display/Hive/CSV+Serde], there are some extras steps involved in getting the CSV SerDe working with Hive. {code} CREATE TABLE my_table(a string, b string, ...) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( "separatorChar" = "\t", "quoteChar" = "'", "escapeChar"= "\\" ) STORED AS TEXTFILE; {code} I would like to propose that we move this SerDe into first-class status: {{STORED AS TEXT_CSV}} {{STORED AS TEXT_TSV}} The user should have to perform no additional steps to use this SerDe. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-19845) Create Table Stored By MIME Type
BELUGA BEHR created HIVE-19845: -- Summary: Create Table Stored By MIME Type Key: HIVE-19845 URL: https://issues.apache.org/jira/browse/HIVE-19845 Project: Hive Issue Type: Improvement Components: HiveServer2, Serializers/Deserializers Affects Versions: 2.3.2, 3.0.0, 4.0.0 Reporter: BELUGA BEHR Instead of doing 'STORED BY XXX' perhaps we can standardized on [MIME|https://en.wikipedia.org/wiki/MIME] types. ||Mime||Type|| |"text/csv"|Comma Separated Value| |"text/tab-separated-values"|Tab Separated Value| |"{{application/json"}}|JSON| |"text/xml"|XML| |"{{application/avro"}}|Apache Avro| |"{{application/parquet"}}|Apache Parquet| |"{{application/orc"}}|Apache Hive ORC| -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-19846) Removed Deprecated Calls From FileUtils-getJarFilesByPath
BELUGA BEHR created HIVE-19846: -- Summary: Removed Deprecated Calls From FileUtils-getJarFilesByPath Key: HIVE-19846 URL: https://issues.apache.org/jira/browse/HIVE-19846 Project: Hive Issue Type: Improvement Components: HiveServer2 Affects Versions: 3.0.0, 4.0.0 Reporter: BELUGA BEHR Attachments: HIVE-19846.1.patch -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-19847) Create Separate getInputSummary Service
BELUGA BEHR created HIVE-19847: -- Summary: Create Separate getInputSummary Service Key: HIVE-19847 URL: https://issues.apache.org/jira/browse/HIVE-19847 Project: Hive Issue Type: Improvement Components: HiveServer2 Affects Versions: 3.0.0, 4.0.0 Reporter: BELUGA BEHR Attachments: HIVE-19847.1.patch The Hive {{org.apache.hadoop.hive.ql.exec.Utilities.java}} file has taken on a life of its own. We should consider separating out the various components into their own classes. For this ticket, I propose separating out the {{getInputSummary}} functionality into its own class. There are several issues with the current implementation: # It is [synchronized|https://github.com/apache/hive/blob/f27c38ff55902827499192a4f8cf8ed37d6fd967/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L2383]. Only one query can get file input summary at a time. For a query which deals with a large data set with a large number of files, this can block other queries for a long period of time. This is especially painful when most queries use a small data set, but a large data set is submitted on occasion. # For each query, time is spend setting up and tearing down a ThreadPool # It uses deprecated code I propose breaking it out into its own class and creating a single thread pool that all queries pull from. In this way, the bottle neck will be one the number of available threads, not on a single query and if a big query is running and a small query is also submitted, the smaller query will be able to proceed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-19900) HiveCLI HoS Performs Invalid Impersonation If User Name Truncated
BELUGA BEHR created HIVE-19900: -- Summary: HiveCLI HoS Performs Invalid Impersonation If User Name Truncated Key: HIVE-19900 URL: https://issues.apache.org/jira/browse/HIVE-19900 Project: Hive Issue Type: Improvement Components: CLI, Spark Affects Versions: 2.3.2, 3.0.0, 1.2.2, 4.0.0 Reporter: BELUGA BEHR The HiveCLI HoS code relies on the system property {{user.name}} when performing impersonations. The code decides to do an impersonation if the {{user.name}} system property does not match the current user who is launching the HiveCLI client. However, when confronted with a long user name, some shells and linux distros may opt to truncate the user name to a certain size to conserve screen space. In these scenarios, the current user name does not match the {{user.name}} system property and never will, so impersonation will always happen, even though the user is trying to impersonate themselves. If YARN is not setup to allow the current user to impersonate, YARN will reject the request. {code:java} if (hiveConf.getBoolVar(HiveConf.ConfVars.HIVE_SERVER2_ENABLE_DOAS)) { try { String currentUser = Utils.getUGI().getShortUserName(); // do not do impersonation in CLI mode if (!currentUser.equals(System.getProperty("user.name"))) { LOG.info("Attempting impersonation of " + currentUser); addProxyUser(currentUser); } } catch (Exception e) { String msg = "Cannot obtain username: " + e; throw new IllegalStateException(msg, e); } } {code} [https://github.com/apache/hive/blob/da66386662fbbcbde9501b4a7b27d076bcc790d4/spark-client/src/main/java/org/apache/hive/spark/client/AbstractSparkClient.java#L354-L366] Assuming a kerberos enabled environment, the error message in the YARN Resource Manager will be: {code:java} my-really-really-long-user-n...@hadoop.domain.com is not allowed to impersonate my-really-really-long-user-name {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-19901) Consolidate JsonSerde Classes
BELUGA BEHR created HIVE-19901: -- Summary: Consolidate JsonSerde Classes Key: HIVE-19901 URL: https://issues.apache.org/jira/browse/HIVE-19901 Project: Hive Issue Type: Improvement Affects Versions: 2.3.2, 3.0.0 Reporter: BELUGA BEHR Fix For: 4.0.0 There's a {{serde/src/java/org/apache/hadoop/hive/serde2/JsonSerDe.java}} and a {{hcatalog/core/src/main/java/org/apache/hive/hcatalog/data/JsonSerDe.java}} in the Hive project. Please consolidate down into a single JsonSerDe, preferably in the 'serde2' package. Please note however there there is only a {{TestJsonSerDe.java}} for the _hcatalog_ package, not the _sede2_ package. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-19933) ALTER TABLE DROP PARTITION - Partition Not Found
BELUGA BEHR created HIVE-19933: -- Summary: ALTER TABLE DROP PARTITION - Partition Not Found Key: HIVE-19933 URL: https://issues.apache.org/jira/browse/HIVE-19933 Project: Hive Issue Type: Improvement Components: HiveServer2 Affects Versions: 1.2.2 Reporter: BELUGA BEHR {code:sql} ALTER TABLE web_logsz DROP PARTITION (`date`='xyz') -- SemanticException [Error 10001]: Table not found web_logsz ALTER TABLE web_logs DROP PARTITION (`date`='xyz') -- Success. {code} There is no 'xyz' partition for the 'date' column. To make this more consistent, the query should fail if the user tries to drop a partition that does not exist -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-19960) Add Column Mapping to JsonSerDe
BELUGA BEHR created HIVE-19960: -- Summary: Add Column Mapping to JsonSerDe Key: HIVE-19960 URL: https://issues.apache.org/jira/browse/HIVE-19960 Project: Hive Issue Type: New Feature Components: Serializers/Deserializers Affects Versions: 3.0.0, 4.0.0 Reporter: BELUGA BEHR Enhance the JSON SerDe to accept a SerDe property that allows for arbitrary mapping of JSON parameter names to column names. This would be very similar to Hive HBase integration. {code} {"fname":"John","lname":"Doe"} CREATE TABLE ( first_name string, last_name string, ... WITH SERDEPROPERTIES ( "json.columns.mapping" = "fname:first_name,lname:last_name" ); {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-19977) Improve Output of SHOW PARTITIONS
BELUGA BEHR created HIVE-19977: -- Summary: Improve Output of SHOW PARTITIONS Key: HIVE-19977 URL: https://issues.apache.org/jira/browse/HIVE-19977 Project: Hive Issue Type: Improvement Components: Hive Affects Versions: 3.0.0, 4.0.0 Reporter: BELUGA BEHR {code:sql} > create table partition_test (a string) partitioned by (b string, c string); > insert into table partition_test partition (b='z', c='z') VALUES ('top'); > show partitions partition_test; b=z/c=z {code} I think it would be more informative in a table format: ||b||z|| |z|z| This clearly provides the information and prevents users from doing something like... {code:sql} > alter table partition_test drop partition ("b=z/c=z"); {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20015) Populate ArrayList with Constructor
BELUGA BEHR created HIVE-20015: -- Summary: Populate ArrayList with Constructor Key: HIVE-20015 URL: https://issues.apache.org/jira/browse/HIVE-20015 Project: Hive Issue Type: Improvement Components: HiveServer2 Affects Versions: 3.0.0, 4.0.0 Reporter: BELUGA BEHR {code:title=MapWork.java} public ArrayList> getWorks() { return new ArrayList>(aliasToWork.values()); } public ArrayList getPaths() { ArrayList ret=new ArrayList<>(); ret.addAll(pathToAliases.keySet()); return ret; } {code} {{getWorks}} method correctly uses the constructor to populate the {{ArrayList}}. Please update {{getPaths}} method to do the same, instead of creating an empty array, then creating a new array to accommodate the {{addAll}} request. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20023) Give Indication That Query is Blocked on a Lock in HS2 UI
BELUGA BEHR created HIVE-20023: -- Summary: Give Indication That Query is Blocked on a Lock in HS2 UI Key: HIVE-20023 URL: https://issues.apache.org/jira/browse/HIVE-20023 Project: Hive Issue Type: Improvement Components: HiveServer2, Web UI Affects Versions: 3.0.0, 4.0.0 Reporter: BELUGA BEHR Please provide a clear indication that a Hive query is blocked by a table/partition lock in the HS2 WebUI. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20136) Code Review of ArchiveUtils Class
BELUGA BEHR created HIVE-20136: -- Summary: Code Review of ArchiveUtils Class Key: HIVE-20136 URL: https://issues.apache.org/jira/browse/HIVE-20136 Project: Hive Issue Type: Improvement Components: HiveServer2 Affects Versions: 3.0.0, 4.0.0 Reporter: BELUGA BEHR General code review of {{ArchiveUtil}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20156) Printing Stacktrace to STDERR
BELUGA BEHR created HIVE-20156: -- Summary: Printing Stacktrace to STDERR Key: HIVE-20156 URL: https://issues.apache.org/jira/browse/HIVE-20156 Project: Hive Issue Type: Improvement Components: HiveServer2 Affects Versions: 3.0.0, 4.0.0 Reporter: BELUGA BEHR Class {{org.apache.hadoop.hive.ql.exec.JoinOperator}} has the following code: {code} } catch (Exception e) { e.printStackTrace(); throw new HiveException(e); } {code} Do not print the stack trace to STDERR with a call to {{printStackTrace()}}. Please remove that line and let the code catching the {{HiveException}} worry about printing any messages through a logger. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20157) Do Not Print StackTraces to STDERR
BELUGA BEHR created HIVE-20157: -- Summary: Do Not Print StackTraces to STDERR Key: HIVE-20157 URL: https://issues.apache.org/jira/browse/HIVE-20157 Project: Hive Issue Type: Improvement Components: Parser Affects Versions: 3.0.0, 4.0.0 Reporter: BELUGA BEHR {{org/apache/hadoop/hive/ql/parse/ParseDriver.java}} {code} catch (RecognitionException e) { e.printStackTrace(); throw new ParseException(parser.errors); } {code} Do not use {{e.printStackTrace()}} and print to STDERR. Either remove or replace with a debug-level log statement. I would vote to simply remove. There are several occurrences of this pattern in this class. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20158) Do Not Print StackTraces to STDERR in Base64TextOutputFormat
BELUGA BEHR created HIVE-20158: -- Summary: Do Not Print StackTraces to STDERR in Base64TextOutputFormat Key: HIVE-20158 URL: https://issues.apache.org/jira/browse/HIVE-20158 Project: Hive Issue Type: Improvement Components: Contrib Affects Versions: 3.0.0, 4.0.0 Reporter: BELUGA BEHR https://github.com/apache/hive/blob/6d890faf22fd1ede3658a5eed097476eab3c67e9/contrib/src/java/org/apache/hadoop/hive/contrib/fileformat/base64/Base64TextOutputFormat.java {code} try { String signatureString = job.get("base64.text.output.format.signature"); if (signatureString != null) { signature = signatureString.getBytes("UTF-8"); } else { signature = new byte[0]; } } catch (UnsupportedEncodingException e) { e.printStackTrace(); } {code} The {{UnsupportedEncodingException}} is coming from the {{getBytes}} method call. Instead, use the {{CharSet}} version of the method and it doesn't throw this explicit exception so the 'try' block can simply be removed. Every JVM will support UTF-8. https://docs.oracle.com/javase/7/docs/api/java/lang/String.html#getBytes(java.nio.charset.Charset) https://docs.oracle.com/javase/7/docs/api/java/nio/charset/StandardCharsets.html#UTF_8 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20159) Do Not Print StackTraces to STDERR in ConditionalResolverSkewJoin
BELUGA BEHR created HIVE-20159: -- Summary: Do Not Print StackTraces to STDERR in ConditionalResolverSkewJoin Key: HIVE-20159 URL: https://issues.apache.org/jira/browse/HIVE-20159 Project: Hive Issue Type: Improvement Components: Query Planning Affects Versions: 3.0.0, 4.0.0 Reporter: BELUGA BEHR https://github.com/apache/hive/blob/6d890faf22fd1ede3658a5eed097476eab3c67e9/ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverSkewJoin.java#L121 {code} } catch (IOException e) { e.printStackTrace(); } {code} Introduce an SLF4J logger to this class and print a WARN level log message if the {{IOException}} from {{Utilities.listStatusIfExists}} is generated. I suggest WARN because the entire operation doesn't fail if this error happens. It continues on its way with the data that it was able to collect. I'm not sure if this is the intended behavior, but for now, an error message in the logging would be better. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20160) Do Not Print StackTraces to STDERR in OperatorFactory
BELUGA BEHR created HIVE-20160: -- Summary: Do Not Print StackTraces to STDERR in OperatorFactory Key: HIVE-20160 URL: https://issues.apache.org/jira/browse/HIVE-20160 Project: Hive Issue Type: Improvement Components: Query Processor Affects Versions: 3.0.0, 4.0.0 Reporter: BELUGA BEHR https://github.com/apache/hive/blob/ac6b2a3fb195916e22b2e5f465add2ffbcdc7430/ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java#L158 {code} } catch (Exception e) { e.printStackTrace(); throw new HiveException(... {code} Do not print the stack trace. The error is being wrapped in a HiveException. Allow the code catching this exception to print the error to a logger instead of dumping it here to STDERR. There are several instances of this in the class. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20161) Do Not Print StackTraces to STDERR in ParseDriver
BELUGA BEHR created HIVE-20161: -- Summary: Do Not Print StackTraces to STDERR in ParseDriver Key: HIVE-20161 URL: https://issues.apache.org/jira/browse/HIVE-20161 Project: Hive Issue Type: Improvement Components: Query Planning Affects Versions: 3.0.0, 4.0.0 Reporter: BELUGA BEHR https://github.com/apache/hive/blob/6d890faf22fd1ede3658a5eed097476eab3c67e9/ql/src/java/org/apache/hadoop/hive/ql/exec/JoinOperator.java {code} // Do not print stack trace to STDERR - remove this, just throw the HiveException } catch (Exception e) { e.printStackTrace(); throw new HiveException(e); } ... // Do not log and throw. log *or* throw. In this case, just throw. Remove logging. // Remove explicit 'return' call. No need for it. try { skewJoinKeyContext.endGroup(); } catch (IOException e) { LOG.error(e.getMessage(), e); throw new HiveException(e); } return; {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20162) Do Not Print StackTraces to STDERR in AbstractJoinTaskDispatcher
BELUGA BEHR created HIVE-20162: -- Summary: Do Not Print StackTraces to STDERR in AbstractJoinTaskDispatcher Key: HIVE-20162 URL: https://issues.apache.org/jira/browse/HIVE-20162 Project: Hive Issue Type: Improvement Components: Query Planning Affects Versions: 3.0.0, 4.0.0 Reporter: BELUGA BEHR https://github.com/apache/hive/blob/6d890faf22fd1ede3658a5eed097476eab3c67e9/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/AbstractJoinTaskDispatcher.java {code} } catch (Exception e) { e.printStackTrace(); throw new SemanticException("Generate Map Join Task Error: " + e.getMessage()); } {code} Remove the call to {{printStackTrace}} and just throw the error. If the stack trace really is needed (doubtful), then pass it to the {{SemanticException}} constructor. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20163) Simplify StringSubstrColStart Initialization
BELUGA BEHR created HIVE-20163: -- Summary: Simplify StringSubstrColStart Initialization Key: HIVE-20163 URL: https://issues.apache.org/jira/browse/HIVE-20163 Project: Hive Issue Type: Improvement Components: Query Processor Affects Versions: 3.0.0, 4.0.0 Reporter: BELUGA BEHR Attachments: HIVE-20163.1.patch * Remove code * Remove exception handling * Remove {{printStackTrace}} call -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20166) LazyBinaryStruct Warn Level Logging
BELUGA BEHR created HIVE-20166: -- Summary: LazyBinaryStruct Warn Level Logging Key: HIVE-20166 URL: https://issues.apache.org/jira/browse/HIVE-20166 Project: Hive Issue Type: Improvement Components: Serializers/Deserializers Affects Versions: 3.0.0, 4.0.0 Reporter: BELUGA BEHR https://github.com/apache/hive/blob/6d890faf22fd1ede3658a5eed097476eab3c67e9/serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinaryStruct.java#L177-L180 {code} // Extra bytes at the end? if (!extraFieldWarned && lastFieldByteEnd < structByteEnd) { extraFieldWarned = true; LOG.warn("Extra bytes detected at the end of the row! " + "Last field end " + lastFieldByteEnd + " and serialize buffer end " + structByteEnd + ". " + "Ignoring similar problems."); } // Missing fields? if (!missingFieldWarned && lastFieldByteEnd > structByteEnd) { missingFieldWarned = true; LOG.info("Missing fields! Expected " + fields.length + " fields but " + "only got " + fieldId + "! " + "Last field end " + lastFieldByteEnd + " and serialize buffer end " + structByteEnd + ". " + "Ignoring similar problems."); } {code} The first log statement is a 'warn' level logging, the second is an 'info' level logging. Please change the second log to also be a 'warn'. This seems like it could be a problem that the user would like to know about. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20168) ReduceSinkOperator Logging Hidden
BELUGA BEHR created HIVE-20168: -- Summary: ReduceSinkOperator Logging Hidden Key: HIVE-20168 URL: https://issues.apache.org/jira/browse/HIVE-20168 Project: Hive Issue Type: Bug Components: Operators Affects Versions: 3.0.0, 4.0.0 Reporter: BELUGA BEHR [https://github.com/apache/hive/blob/ac6b2a3fb195916e22b2e5f465add2ffbcdc7430/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java] {code:java} if (LOG.isTraceEnabled()) { if (numRows == cntr) { cntr = logEveryNRows == 0 ? cntr * 10 : numRows + logEveryNRows; if (cntr < 0 || numRows < 0) { cntr = 0; numRows = 1; } LOG.info(toString() + ": records written - " + numRows); } } ... if (LOG.isTraceEnabled()) { LOG.info(toString() + ": records written - " + numRows); } {code} There are logging guards here checking for TRACE level debugging but the logging is actually INFO. This is important logging for detecting data skew. Please change guards to check for INFO... or I would prefer that the guards are removed altogether since it's very rare that a service is running with only WARN level logging. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20169) Print Final Rows Processed in MapOperator
BELUGA BEHR created HIVE-20169: -- Summary: Print Final Rows Processed in MapOperator Key: HIVE-20169 URL: https://issues.apache.org/jira/browse/HIVE-20169 Project: Hive Issue Type: Improvement Components: Operators Affects Versions: 3.0.0, 4.0.0 Reporter: BELUGA BEHR https://github.com/apache/hive/blob/ac6b2a3fb195916e22b2e5f465add2ffbcdc7430/ql/src/java/org/apache/hadoop/hive/ql/exec/MapOperator.java#L573-L582 This class emits a log message every time it a certain number of records are processed, but it does not print a final count. Overload the {{MapOperator}} class's {{closeOp}} method to print a final log message providing the total number of rows read by this mapper. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20170) Improve JoinOperator "rows for join key" Logging
BELUGA BEHR created HIVE-20170: -- Summary: Improve JoinOperator "rows for join key" Logging Key: HIVE-20170 URL: https://issues.apache.org/jira/browse/HIVE-20170 Project: Hive Issue Type: Improvement Components: Operators Affects Versions: 3.0.0, 4.0.0 Reporter: BELUGA BEHR {code} 2018-06-25 09:37:33,193 INFO [main] org.apache.hadoop.hive.ql.exec.CommonJoinOperator: table 0 has 5728000 rows for join key [333, 22] 2018-06-25 09:37:33,901 INFO [main] org.apache.hadoop.hive.ql.exec.CommonJoinOperator: table 0 has 5828000 rows for join key [333, 22] 2018-06-25 09:37:34,623 INFO [main] org.apache.hadoop.hive.ql.exec.CommonJoinOperator: table 0 has 5928000 rows for join key [333, 22] 2018-06-25 09:37:35,342 INFO [main] org.apache.hadoop.hive.ql.exec.CommonJoinOperator: table 0 has 6028000 rows for join key [333, 22] {code} https://github.com/apache/hive/blob/6d890faf22fd1ede3658a5eed097476eab3c67e9/ql/src/java/org/apache/hadoop/hive/ql/exec/JoinOperator.java#L120 This logging should use the same facilities as the other Operators for emitting this type of log message. [HIVE-10078] Maybe this feature should be refactored into an AbstractOperator class? Also, it should print a final count for each join value. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20171) Make hive.stats.autogather Per Table
BELUGA BEHR created HIVE-20171: -- Summary: Make hive.stats.autogather Per Table Key: HIVE-20171 URL: https://issues.apache.org/jira/browse/HIVE-20171 Project: Hive Issue Type: Improvement Components: HiveServer2, Standalone Metastore Affects Versions: 3.0.0, 4.0.0 Reporter: BELUGA BEHR {{hive.stats.autogather}} {{hive.stats.column.autogather}} https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties These are currently global-level settings. Make these global setting the 'default' values for tables but allow for these configurations to be override by the table's properties. Recently started seeing tables backed by S3 that are not regularly queried but that the CREATE TABLE is very slow to collect the stats (30+ minutes) for all of the files in the table. We would like to turn this feature off for certain S3 tables. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20190) Report Client IP Address When Opening New Session
BELUGA BEHR created HIVE-20190: -- Summary: Report Client IP Address When Opening New Session Key: HIVE-20190 URL: https://issues.apache.org/jira/browse/HIVE-20190 Project: Hive Issue Type: Improvement Components: HiveServer2 Affects Versions: 2.3.2, 3.0.0, 4.0.0 Reporter: BELUGA BEHR https://github.com/apache/hive/blob/e7d1781ec4662e088dcd6ffbe3f866738792ad9b/service/src/java/org/apache/hive/service/cli/thrift/ThriftCLIService.java#L320 There are times when a misbehaving client can knock a HS2 instance offline because it opens many simultaneous connections and takes up all of the resources. It would be nice if we could log the source IP address of each connection along with the "Client protocol version" information. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20222) Enable Skew Join Optimization For Outer Joins
BELUGA BEHR created HIVE-20222: -- Summary: Enable Skew Join Optimization For Outer Joins Key: HIVE-20222 URL: https://issues.apache.org/jira/browse/HIVE-20222 Project: Hive Issue Type: New Feature Components: Logical Optimizer Affects Versions: 3.0.0, 4.0.0 Reporter: BELUGA BEHR {code} // We are trying to adding map joins to handle skew keys, and map join right // now does not work with outer joins if (!GenMRSkewJoinProcessor.skewJoinEnabled(parseCtx.getConf(), joinOp)) return; {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20223) SmallTableCache.java SLF4J Parameterized Logging
BELUGA BEHR created HIVE-20223: -- Summary: SmallTableCache.java SLF4J Parameterized Logging Key: HIVE-20223 URL: https://issues.apache.org/jira/browse/HIVE-20223 Project: Hive Issue Type: Improvement Components: Spark Affects Versions: 3.0.0, 4.0.0 Reporter: BELUGA BEHR {code:java|title=org/apache/hadoop/hive/ql/exec/spark/SmallTableCache.java} if (LOG.isDebugEnabled()) { LOG.debug("Cleaned up small table cache for query " + queryId); } if (tableContainerMap.putIfAbsent(path, tableContainer) == null && LOG.isDebugEnabled()) { LOG.debug("Cached small table file " + path + " for query " + queryId); } if (tableContainer != null && LOG.isDebugEnabled()) { LOG.debug("Loaded small table file " + path + " from cache for query " + queryId); } {code} Remove {{isDebugEnabled}} and replace with parameterized logging. https://www.slf4j.org/faq.html#logging_performance -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20224) ReplChangeManager.java Remove Logging Guards
BELUGA BEHR created HIVE-20224: -- Summary: ReplChangeManager.java Remove Logging Guards Key: HIVE-20224 URL: https://issues.apache.org/jira/browse/HIVE-20224 Project: Hive Issue Type: Improvement Components: Metastore, Standalone Metastore Affects Versions: 3.0.0, 4.0.0 Reporter: BELUGA BEHR {code:java|title=metastore-server/src/main/java/org/apache/hadoop/hive/metastore/ReplChangeManager.java} if (LOG.isDebugEnabled()) { LOG.debug("A file with the same content of {} already exists, ignore", path.toString()); } // > LOG.debug("A file with the same content of {} already exists, ignore", path); if (LOG.isDebugEnabled()) { LOG.debug("Encoded URI: " + encodedUri); } // > LOG.debug("Encoded URI: {}", encodedUri); if (LOG.isDebugEnabled()) { LOG.debug("Move " + file.toString() + " to trash"); } // > LOG.debug("Move {} to trash", file); ... others {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20233) Review Operator.java
BELUGA BEHR created HIVE-20233: -- Summary: Review Operator.java Key: HIVE-20233 URL: https://issues.apache.org/jira/browse/HIVE-20233 Project: Hive Issue Type: Improvement Components: Query Planning Affects Versions: 3.0.0, 4.0.0 Reporter: BELUGA BEHR Various improvements to {{Operator.java}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20236) Do Not Print StackTraces to STDERR in DDLTask
BELUGA BEHR created HIVE-20236: -- Summary: Do Not Print StackTraces to STDERR in DDLTask Key: HIVE-20236 URL: https://issues.apache.org/jira/browse/HIVE-20236 Project: Hive Issue Type: Improvement Affects Versions: 3.0.0, 4.0.0 Reporter: BELUGA BEHR {code:java|title=DDLTask.java} try { ret = ToolRunner.run(fss, args.toArray(new String[0])); } catch (Exception e) { e.printStackTrace(); throw new HiveException(e); } {code} Don't print the stacktrace to STDERR, deal with handling the error up the call stack by using the HiveException. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20237) Do Not Print StackTraces to STDERR in HiveMetaStore
BELUGA BEHR created HIVE-20237: -- Summary: Do Not Print StackTraces to STDERR in HiveMetaStore Key: HIVE-20237 URL: https://issues.apache.org/jira/browse/HIVE-20237 Project: Hive Issue Type: Improvement Components: Standalone Metastore Affects Versions: 3.0.0, 4.0.0 Reporter: BELUGA BEHR {code:java|title=HiveMetaStore.java} } catch (Throwable x) { x.printStackTrace(); HMSHandler.LOG.error(StringUtils.stringifyException(x)); throw x; } {code} Bad design here of "log and throw". Don't do it. Just throw the exception and let it be handled, and logged, in one place. At the very least, we don't need the error message to go into the STDERR logs with {{printStackTrace}}, please remove. And remove the {{stringifyException}} code. Just use the normal logging faciltiies: {code} HMSHandler.LOG.error("Error", e); {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20238) Remove stringifyException Method
BELUGA BEHR created HIVE-20238: -- Summary: Remove stringifyException Method Key: HIVE-20238 URL: https://issues.apache.org/jira/browse/HIVE-20238 Project: Hive Issue Type: Improvement Affects Versions: 3.0.0, 4.0.0 Reporter: BELUGA BEHR Fix For: 4.0.0 Remove the method {{stringifyException}} https://github.com/apache/hive/blob/c2940a07cf0891e922672782b73ec22551a7eedd/common/src/java/org/apache/hive/common/util/HiveStringUtils.java#L146 The code already exists in Hadoop proper: https://github.com/apache/hadoop/blob/2b2399d623539ab68e71a38fa9fbfc9a405bddb8/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/StringUtils.java#L86 And beyond that, I was told on the Hadoop dev mailing list that this function should not be used anymore. Developers should just be using the SLF4J facilities and not this home-grown thing. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20239) Do Not Print StackTraces to STDERR in MapJoinProcessor
BELUGA BEHR created HIVE-20239: -- Summary: Do Not Print StackTraces to STDERR in MapJoinProcessor Key: HIVE-20239 URL: https://issues.apache.org/jira/browse/HIVE-20239 Project: Hive Issue Type: Improvement Affects Versions: 3.0.0 Reporter: BELUGA BEHR Fix For: 4.0.0 {code:java|title=MapJoinProcessor.java} } catch (Exception e) { e.printStackTrace(); throw new SemanticException("Failed to generate new mapJoin operator " + "by exception : " + e.getMessage()); } {code} Please change to... something like... {code} } catch (Exception e) { throw new SemanticException("Failed to generate new mapJoin operator", e); } {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20255) Review LevelOrderWalker.java
BELUGA BEHR created HIVE-20255: -- Summary: Review LevelOrderWalker.java Key: HIVE-20255 URL: https://issues.apache.org/jira/browse/HIVE-20255 Project: Hive Issue Type: Improvement Components: Query Planning Affects Versions: 3.0.0, 4.0.0 Reporter: BELUGA BEHR Fix For: 3.1.0 Attachments: HIVE-20255.1.patch https://github.com/apache/hive/blob/6d890faf22fd1ede3658a5eed097476eab3c67e9/ql/src/java/org/apache/hadoop/hive/ql/lib/LevelOrderWalker.java * Make code more concise * Fix some check style issues {code} if (toWalk.get(index).getChildren() != null) { for(Node child : toWalk.get(index).getChildren()) { {code} Actually, the underlying implementation of {{getChildren()}} has to do some real work, so do not throw away the work after checking for null. Simply call once and store the results. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20257) Improvements to Hive.java
BELUGA BEHR created HIVE-20257: -- Summary: Improvements to Hive.java Key: HIVE-20257 URL: https://issues.apache.org/jira/browse/HIVE-20257 Project: Hive Issue Type: Improvement Affects Versions: 3.0.0, 4.0.0 Reporter: BELUGA BEHR Fix For: 4.0.0 Various fixes to {{Hive.java}} * Use Log4J parameters in logging statements * Fix check styles * Make code more concise * Remove "log and throw" code * Replaced calls to deprecated code * Removed superfluous calls to {{toString}} "log and throw" is considered and anti-pattern. Only the highest level catch should be providing detailed logging otherwise we print the same stack trace to the logs several times and with different context (for example when an exception is wrapped, we get two different logging events). https://community.oracle.com/docs/DOC-983543#logAndThrow -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20258) Should Syncronize getInstance in ReplChangeManager
BELUGA BEHR created HIVE-20258: -- Summary: Should Syncronize getInstance in ReplChangeManager Key: HIVE-20258 URL: https://issues.apache.org/jira/browse/HIVE-20258 Project: Hive Issue Type: Bug Components: Metastore Affects Versions: 3.0.0, 4.0.0 Reporter: BELUGA BEHR {code:java} public static ReplChangeManager getInstance(Configuration conf) throws MetaException { if (instance == null) { instance = new ReplChangeManager(conf); } return instance; } {code} This method needs to be synchronized or two different callers will see a 'null' value and each create their own manager. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20317) Spark Dynamic Partition Pruning - Use Stats to Determine Partition Count
BELUGA BEHR created HIVE-20317: -- Summary: Spark Dynamic Partition Pruning - Use Stats to Determine Partition Count Key: HIVE-20317 URL: https://issues.apache.org/jira/browse/HIVE-20317 Project: Hive Issue Type: Improvement Components: Spark Affects Versions: 3.1.0, 4.0.0 Reporter: BELUGA BEHR {code:xml|hive-site.xml} hive.metastore.limit.partition.request 2 {code} {code:sql} CREATE TABLE partitioned_user( firstname VARCHAR(64), lastname VARCHAR(64) ) PARTITIONED BY (country VARCHAR(64)) STORED AS PARQUET; CREATE TABLE country( name VARCHAR(64) ) STORED AS PARQUET; insert into partitioned_user partition (country='USA') values ("John", "Doe"); insert into partitioned_user partition (country='UK') values ("Sir", "Arthur"); insert into partitioned_user partition (country='FR') values ("Jacque", "Martin"); insert into country values ('USA'); set hive.execution.engine=spark; set hive.spark.dynamic.partition.pruning=true; explain select * from partitioned_user u where u.country in (select c.name from country c); -- Error while compiling statement: FAILED: SemanticException MetaException(message:Number of partitions scanned (=3) on table 'partitioned_user' exceeds limit (=2). This is controlled on the metastore server by hive.metastore.limit.partition.request.) {code} The EXPLAIN plan generation fails because there are three partitions involved in this query. However, since Spark DPP is enabled, Hive should be able to use table stats to know that the {{country}} table only has one record and therefore there will only need to be one partitioned scanned and allow this query to execute. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20484) Disable Block Cache By Default With HBase SerDe
BELUGA BEHR created HIVE-20484: -- Summary: Disable Block Cache By Default With HBase SerDe Key: HIVE-20484 URL: https://issues.apache.org/jira/browse/HIVE-20484 Project: Hive Issue Type: Improvement Components: HBase Handler Affects Versions: 1.2.3, 2.4.0, 4.0.0, 3.2.0 Reporter: BELUGA BEHR {quote} Scan instances can be set to use the block cache in the RegionServer via the setCacheBlocks method. For input Scans to MapReduce jobs, this should be false. https://hbase.apache.org/book.html#perf.hbase.client.blockcache {quote} However, from the Hive code, we can see that this is not the case. {code} public static final String HBASE_SCAN_CACHEBLOCKS = "hbase.scan.cacheblock"; ... String scanCacheBlocks = tableProperties.getProperty(HBaseSerDe.HBASE_SCAN_CACHEBLOCKS); if (scanCacheBlocks != null) { jobProperties.put(HBaseSerDe.HBASE_SCAN_CACHEBLOCKS, scanCacheBlocks); } ... String scanCacheBlocks = jobConf.get(HBaseSerDe.HBASE_SCAN_CACHEBLOCKS); if (scanCacheBlocks != null) { scan.setCacheBlocks(Boolean.parseBoolean(scanCacheBlocks)); } {code} In the Hive code, we can see that if {{hbase.scan.cacheblock}} is not specified in the {{SERDEPROPERTIES}} then {{setCacheBlocks}} is not called and the default value of the HBase {{Scan}} class is used. {code:java|title=Scan.java} /** * Set whether blocks should be cached for this Scan. * * This is true by default. When true, default settings of the table and * family are used (this will never override caching blocks if the block * cache is disabled for that family or entirely). * * @param cacheBlocks if false, default settings are overridden and blocks * will not be cached */ public Scan setCacheBlocks(boolean cacheBlocks) { this.cacheBlocks = cacheBlocks; return this; } {code} Hive is doing full scans of the table with MapReduce/Spark and therefore, according to the HBase docs, the default behavior here should be that blocks are not cached. Hive should set this value to "false" by default unless the table {{SERDEPROPERTIES}} override this. {code:sql} -- Commands for HBase -- create 'test', 't' CREATE EXTERNAL TABLE test(value map, row_key string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( "hbase.columns.mapping" = "t:,:key", "hbase.scan.cacheblock" = "false" ); {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20619) Include MultiDelimitSerDe in HIveServer2 By Default
BELUGA BEHR created HIVE-20619: -- Summary: Include MultiDelimitSerDe in HIveServer2 By Default Key: HIVE-20619 URL: https://issues.apache.org/jira/browse/HIVE-20619 Project: Hive Issue Type: Improvement Components: HiveServer2, Serializers/Deserializers Affects Versions: 3.0.0, 4.0.0 Reporter: BELUGA BEHR In [HIVE-20020], the hive-contrib JAR file was removed from the HiveServer2 classpath. With this change, the {{MultiDelimitSerDe}} is no longer included. This is fine, because {{MultiDelimitSerDe}} was a pain in that environment anyway. It was available to HiveServer2, and therefore would work with a limited set of queries (select * from table limit 1) but any other query on that table which launched a MapReduce project would fail because the hive-contrib JAR file was not sent out with the rest of the Hive JARs for MapReduce jobs. Please bring {{MultiDelimitSerDe}} back into the fold so that it's available to users out of the box without having to install the hive-contrib JAR into the HiveServer2 auxiliary directory. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20665) Hive Parallel Tasks - Hive Configuration ConcurrentModificationException
BELUGA BEHR created HIVE-20665: -- Summary: Hive Parallel Tasks - Hive Configuration ConcurrentModificationException Key: HIVE-20665 URL: https://issues.apache.org/jira/browse/HIVE-20665 Project: Hive Issue Type: Bug Components: HiveServer2 Affects Versions: 3.1.0, 2.3.2, 4.0.0 Reporter: BELUGA BEHR When parallel tasks are enabled in Hive, all of the resulting queries share the same Hive configuration. This is problematic as each query will modify the same {{HiveConf}} object with things like query ID and query text. This will overwrite each other and cause {{ConcurrentModificationException}} issues. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20797) Print Number of Locks Acquired
BELUGA BEHR created HIVE-20797: -- Summary: Print Number of Locks Acquired Key: HIVE-20797 URL: https://issues.apache.org/jira/browse/HIVE-20797 Project: Hive Issue Type: Improvement Components: HiveServer2, Locking Affects Versions: 4.0.0 Reporter: BELUGA BEHR The number of locks acquired by a query can greatly influence the performance and stability of the system, especially for ZK locks. Please add INFO level logging with the number of locks each query obtains. Log here: https://github.com/apache/hive/blob/3963c729fabf90009cb67d277d40fe5913936358/ql/src/java/org/apache/hadoop/hive/ql/Driver.java#L1670-L1672 {quote} A list of acquired locks will be stored in the org.apache.hadoop.hive.ql.Context object and can be retrieved via org.apache.hadoop.hive.ql.Context#getHiveLocks. {quote} https://github.com/apache/hive/blob/758ff449099065a84c46d63f9418201c8a6731b1/ql/src/java/org/apache/hadoop/hive/ql/lockmgr/HiveTxnManager.java#L115-L127 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20831) Add Session ID to Operation Logging
BELUGA BEHR created HIVE-20831: -- Summary: Add Session ID to Operation Logging Key: HIVE-20831 URL: https://issues.apache.org/jira/browse/HIVE-20831 Project: Hive Issue Type: Improvement Components: HiveServer2 Affects Versions: 3.1.0, 4.0.0 Reporter: BELUGA BEHR {code:java|title=OperationManager.java} LOG.info("Adding operation: " + operation.getHandle()); {code} Please add additional logging to explicitly state which Hive session this operation is being added to. https://github.com/apache/hive/blob/3963c729fabf90009cb67d277d40fe5913936358/service/src/java/org/apache/hive/service/cli/operation/OperationManager.java#L201 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20832) Locate Operation Logs Under Session Directory
BELUGA BEHR created HIVE-20832: -- Summary: Locate Operation Logs Under Session Directory Key: HIVE-20832 URL: https://issues.apache.org/jira/browse/HIVE-20832 Project: Hive Issue Type: Improvement Components: HiveServer2 Affects Versions: 3.1.0, 4.0.0 Reporter: BELUGA BEHR If I understand the Hive entity relationship model correctly, each Hive Session will have 0 or more Operations associated with it. Here is the current session setup sequence: {code} 2018-10-24 21:06:03,771 INFO org.apache.hadoop.hive.ql.session.SessionState: [HiveServer2-Handler-Pool: Thread-510932]: Created local directory: /tmp/hive/7650c8ff-2ba5-4bbb-964f-fcecc45e5193 2018-10-24 21:06:03,779 INFO org.apache.hadoop.hive.ql.session.SessionState: [HiveServer2-Handler-Pool: Thread-510932]: Created HDFS directory: /tmp/hive/hive/7650c8ff-2ba5-4bbb-964f-fcecc45e5193/_tmp_space.db 2018-10-24 21:06:03,782 INFO org.apache.hive.service.cli.session.HiveSessionImpl: [HiveServer2-Handler-Pool: Thread-510932]: Operation log session directory is created: /var/log/hive/operation_logs/7650c8ff-2ba5-4bbb-964f-fcecc45e5193 {code} The Hive Session gets its own directory on the local FS and operation logs get their own space as well. Can we please merge so that all of the Operation directories are stored within their associated Hive session directory? Something like... {code} 2018-10-24 21:06:03,771 INFO org.apache.hadoop.hive.ql.session.SessionState: [HiveServer2-Handler-Pool: Thread-510932]: Created local directory: /tmp/hive/7650c8ff-2ba5-4bbb-964f-fcecc45e5193 2018-10-24 21:06:03,779 INFO org.apache.hadoop.hive.ql.session.SessionState: [HiveServer2-Handler-Pool: Thread-510932]: Created HDFS directory: /tmp/hive/hive/7650c8ff-2ba5-4bbb-964f-fcecc45e5193/_tmp_space.db 2018-10-24 21:06:03,782 INFO org.apache.hive.service.cli.session.HiveSessionImpl: [HiveServer2-Handler-Pool: Thread-510932]: Operation log session directory is created: /tmp/hive/7650c8ff-2ba5-4bbb-964f-fcecc45e5193/operation_logs/7650c8ff-2ba5-4bbb-964f-fcecc45e5193 {code} Allows removal of configuration {{hive.server2.logging.operation.log.location}}. One less thing an operator needs to worry/know about. One set of logs are in {{/tmp}} and the other is {{/var}}. A bit confusing. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20844) Cache Instances of CacheManager in DummyTxnManager
BELUGA BEHR created HIVE-20844: -- Summary: Cache Instances of CacheManager in DummyTxnManager Key: HIVE-20844 URL: https://issues.apache.org/jira/browse/HIVE-20844 Project: Hive Issue Type: Improvement Components: HiveServer2, Locking Affects Versions: 3.1.0, 2.3.2, 4.0.0 Reporter: BELUGA BEHR I noticed that the {{DummyTxnManager}} class instantiates quite a few instances of {{ZooKeeperHiveLockManager}}. The ZooKeeper LM creates a connection to ZK for each instance created. It also does some initialization steps that are almost always just noise and pressure on ZooKeeper because it has already been initialized and the steps are therefore NOOPs. {{ZooKeeperHiveLockManager}} should be a singleton class with one long-lived connection to the ZooKeeper service. Perhaps the {{HiveLockManager}} interface could have a {{isSingleton()}} method which indicates that the LM should only be instantiated once and cached for subsequent sessions. {code:java} 2018-05-14 22:45:30,574 INFO org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager: [HiveServer2-Background-Pool: Thread-1252389]: Creating lock manager of type org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager 2018-05-14 22:51:27,865 INFO org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager: [HiveServer2-Background-Pool: Thread-1252671]: Creating lock manager of type org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager 2018-05-14 22:51:37,552 INFO org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager: [HiveServer2-Background-Pool: Thread-1252686]: Creating lock manager of type org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager 2018-05-14 22:51:49,046 INFO org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager: [HiveServer2-Background-Pool: Thread-1252736]: Creating lock manager of type org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager 2018-05-14 22:51:50,664 INFO org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager: [HiveServer2-Background-Pool: Thread-1252742]: Creating lock manager of type org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager 2018-05-14 23:00:54,314 INFO org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager: [HiveServer2-Background-Pool: Thread-1253479]: Creating lock manager of type org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager 2018-05-14 23:17:26,867 INFO org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager: [HiveServer2-Background-Pool: Thread-1254180]: Creating lock manager of type org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager 2018-05-14 23:24:25,426 INFO org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager: [HiveServer2-Background-Pool: Thread-1255493]: Creating lock manager of type org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager {code} {code:java|title=DummyTxnManager.java} @Override public HiveLockManager getLockManager() throws LockException { if (lockMgr == null) { boolean supportConcurrency = conf.getBoolVar(HiveConf.ConfVars.HIVE_SUPPORT_CONCURRENCY); if (supportConcurrency) { String lockMgrName = conf.getVar(HiveConf.ConfVars.HIVE_LOCK_MANAGER); if ((lockMgrName == null) || (lockMgrName.isEmpty())) { throw new LockException(ErrorMsg.LOCKMGR_NOT_SPECIFIED.getMsg()); } try { // CACHE LM HERE LOG.info("Creating lock manager of type " + lockMgrName); lockMgr = (HiveLockManager)ReflectionUtils.newInstance( conf.getClassByName(lockMgrName), conf); lockManagerCtx = new HiveLockManagerCtx(conf); lockMgr.setContext(lockManagerCtx); } catch (Exception e) { ... {code} [https://github.com/apache/hive/blob/f37c5de6c32b9395d1b34fa3c02ed06d1bfbf6eb/ql/src/java/org/apache/hadoop/hive/ql/lockmgr/HiveLockManager.java] {code:java|title=ZooKeeperHiveLockManager Initialization} try { curatorFramework = CuratorFrameworkSingleton.getInstance(conf); parent = conf.getVar(HiveConf.ConfVars.HIVE_ZOOKEEPER_NAMESPACE); try{ curatorFramework.create().withMode(CreateMode.PERSISTENT).forPath("/" + parent, new byte[0]); } catch (Exception e) { // ignore if the parent already exists if (!(e instanceof KeeperException) || ((KeeperException)e).code() != KeeperException.Code.NODEEXISTS) { LOG.warn("Unexpected ZK exception when creating parent node /" + parent, e); } } {code} https://github.com/apache/hive/blob/f37c5de6c32b9395d1b34fa3c02ed06d1bfbf6eb/ql/src/java/org/apache/hadoop/hive/ql/lockmgr/zookeeper/ZooKeeperHiveLockManager.java#L96-L106 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20847) Review of NullScanCode
BELUGA BEHR created HIVE-20847: -- Summary: Review of NullScanCode Key: HIVE-20847 URL: https://issues.apache.org/jira/browse/HIVE-20847 Project: Hive Issue Type: Improvement Components: Physical Optimizer Affects Versions: 3.1.0, 4.0.0 Reporter: BELUGA BEHR What got me looking at this class was the verboseness of some of the logging. I would like to request that we DEBUG the logging since this level of detail means nothing to a cluster admin. Also... this {{contains}} call would be better applied onto a {{HashSet}} instead of an {{ArrayList}}. {code:java|title=NullScanTaskDispatcher.java} private void processAlias(MapWork work, Path path, ArrayList aliasesAffected, ArrayList aliases) { // the aliases that are allowed to map to a null scan. ArrayList allowed = new ArrayList(); for (String alias : aliasesAffected) { if (aliases.contains(alias)) { allowed.add(alias); } } {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20849) Review of ConstantPropagateProcFactory
BELUGA BEHR created HIVE-20849: -- Summary: Review of ConstantPropagateProcFactory Key: HIVE-20849 URL: https://issues.apache.org/jira/browse/HIVE-20849 Project: Hive Issue Type: Improvement Components: Logical Optimizer Affects Versions: 3.1.0, 4.0.0 Reporter: BELUGA BEHR Attachments: HIVE-20849.1.patch I was looking at this class because it blasts a lot of useless (to an admin) information to the logs. Especially if the table has a lot of columns, I see big blocks of logging that are meaningless to me. I request that the logging is toned down to debug, and some other improvements to the code. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20894) Clean Up JDBC HiveQueryResultSet
BELUGA BEHR created HIVE-20894: -- Summary: Clean Up JDBC HiveQueryResultSet Key: HIVE-20894 URL: https://issues.apache.org/jira/browse/HIVE-20894 Project: Hive Issue Type: Improvement Components: JDBC Affects Versions: 4.0.0 Reporter: BELUGA BEHR -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20895) Cleanup JdbcColumn Class
BELUGA BEHR created HIVE-20895: -- Summary: Cleanup JdbcColumn Class Key: HIVE-20895 URL: https://issues.apache.org/jira/browse/HIVE-20895 Project: Hive Issue Type: Improvement Components: JDBC Affects Versions: 3.1.1, 4.0.0 Reporter: BELUGA BEHR -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20947) Add User Agent String to Hive Client API
BELUGA BEHR created HIVE-20947: -- Summary: Add User Agent String to Hive Client API Key: HIVE-20947 URL: https://issues.apache.org/jira/browse/HIVE-20947 Project: Hive Issue Type: New Feature Components: Clients, Diagnosability, JDBC, ODBC Affects Versions: 4.0.0 Reporter: BELUGA BEHR Allow users to specify a user agent string as part of their JDBC/ODBC connection string and print the information in the HS2 logs. This will allow us the opportunity to identify misbehaving clients. Variable: {{userAgent}} https://en.wikipedia.org/wiki/User_agent#Format_for_human-operated_web_browsers -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20956) Avro Result File Format
BELUGA BEHR created HIVE-20956: -- Summary: Avro Result File Format Key: HIVE-20956 URL: https://issues.apache.org/jira/browse/HIVE-20956 Project: Hive Issue Type: New Feature Components: HiveServer2 Affects Versions: 3.1.1, 4.0.0 Reporter: BELUGA BEHR *hive.query.result.fileformat* * Default Value: ** Hive 0.x, 1.x, and 2.0: {{TextFile}} ** Hive 2.1 onward: {{SequenceFile}} * Added In: Hive 0.7.0 with HIVE-1598 File format to use for a query's intermediate results. Options are TextFile, SequenceFile, and RCfile. Default value is changed to SequenceFile since Hive 2.1.0 (HIVE-1608). Add AVRO to this list -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21071) Improve getInputSummary
BELUGA BEHR created HIVE-21071: -- Summary: Improve getInputSummary Key: HIVE-21071 URL: https://issues.apache.org/jira/browse/HIVE-21071 Project: Hive Issue Type: Improvement Components: HiveServer2 Affects Versions: 3.1.1, 3.0.0, 4.0.0 Reporter: BELUGA BEHR There is a global lock in the {{getInptSummary}} code, so it is important that it be fast. The current implementation has quite a bit of overhead that can be re-engineered. For example, the current implementation keeps a map of File Path to ContentSummary object. This map is populated by several threads concurrently. The method then loops through the map, in a single thread, at the end to add up all of the ContentSummary objects and ignores the paths. The code can be be re-engineered to not use a map, or a collection at all, to store the results and instead just keep a running tally. By keeping a tally, there is no O(n) operation at the end to perform the addition. There are other things can be improved. The method returns an object which is never used anywhere, so change method to void return type. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21073) Remove Extra String Object
BELUGA BEHR created HIVE-21073: -- Summary: Remove Extra String Object Key: HIVE-21073 URL: https://issues.apache.org/jira/browse/HIVE-21073 Project: Hive Issue Type: Improvement Affects Versions: 3.1.1, 4.0.0 Reporter: BELUGA BEHR {code} public static String generatePath(Path baseURI, String filename) { String path = new String(baseURI + Path.SEPARATOR + filename); return path; } public static String generateFileName(Byte tag, String bigBucketFileName) { String fileName = new String("MapJoin-" + tag + "-" + bigBucketFileName + suffix); return fileName; } {code} It's a bit odd to be performing string concatenation and then wrapping the results in a new string. This is creating superfluous String objects. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21147) Remove Contrib RegexSerDe
BELUGA BEHR created HIVE-21147: -- Summary: Remove Contrib RegexSerDe Key: HIVE-21147 URL: https://issues.apache.org/jira/browse/HIVE-21147 Project: Hive Issue Type: Improvement Components: Serializers/Deserializers Affects Versions: 4.0.0 Reporter: BELUGA BEHR Fix For: 4.0.0 https://github.com/apache/hive/blob/f37c5de6c32b9395d1b34fa3c02ed06d1bfbf6eb/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java https://github.com/apache/hive/blob/ae008b79b5d52ed6a38875b73025a505725828eb/serde/src/java/org/apache/hadoop/hive/serde2/RegexSerDe.java Merge any difference in functionality and remove the version in the 'contrib' library -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21148) Remove Use StandardCharsets Where Possible
BELUGA BEHR created HIVE-21148: -- Summary: Remove Use StandardCharsets Where Possible Key: HIVE-21148 URL: https://issues.apache.org/jira/browse/HIVE-21148 Project: Hive Issue Type: Improvement Affects Versions: 4.0.0 Reporter: BELUGA BEHR Fix For: 4.0.0 Starting in Java 1.7, JDKs must support a set of standard charsets. When using this facility, instead of passing the name (string) of the character set, there is no need to catch a {{UnsupportedEncodingException}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21175) Use StandardCharsets Where Possible (Part 2)
BELUGA BEHR created HIVE-21175: -- Summary: Use StandardCharsets Where Possible (Part 2) Key: HIVE-21175 URL: https://issues.apache.org/jira/browse/HIVE-21175 Project: Hive Issue Type: Improvement Affects Versions: 3.2.0 Reporter: BELUGA BEHR Fix For: 4.0.0 Additional work not already addressed by [HIVE-21148]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21179) Move SampleHBaseKeyFactory* Into Main Code Line
BELUGA BEHR created HIVE-21179: -- Summary: Move SampleHBaseKeyFactory* Into Main Code Line Key: HIVE-21179 URL: https://issues.apache.org/jira/browse/HIVE-21179 Project: Hive Issue Type: Improvement Components: HBase Handler Affects Versions: 3.1.0, 4.0.0 Reporter: BELUGA BEHR https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration {quote} "hbase.composite.key.factory" should be the fully qualified class name of a class implementing HBaseKeyFactory. See SampleHBaseKeyFactory2 for a fixed length example in the same package. This class must be on your classpath in order for the above example to work. TODO: place these in an accessible place; they're currently only in test code. {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21192) TestReplicationScenariosIncrementalLoadAcidTables Fails Regularly
BELUGA BEHR created HIVE-21192: -- Summary: TestReplicationScenariosIncrementalLoadAcidTables Fails Regularly Key: HIVE-21192 URL: https://issues.apache.org/jira/browse/HIVE-21192 Project: Hive Issue Type: Improvement Affects Versions: 4.0.0 Reporter: BELUGA BEHR Several of my patches are failing in YETUS due to the following unit test failure: {code} TestReplicationScenariosIncrementalLoadAcidTables - did not produce a TEST-*.xml file (likely timed out) (batchId=251) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21193) Support LZO Compression with CombineHiveInputFormat
BELUGA BEHR created HIVE-21193: -- Summary: Support LZO Compression with CombineHiveInputFormat Key: HIVE-21193 URL: https://issues.apache.org/jira/browse/HIVE-21193 Project: Hive Issue Type: Improvement Components: Compression Affects Versions: 4.0.0, 3.2.0 Reporter: BELUGA BEHR In regards to LZO compression with Hive... https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LZO It does not work out of the box if there are {{.lzo.index}} files present. As I understand it, this is because of the default Hive input format {{CombineHiveInputFormat}} does not handle this correctly. It does not like that there are a mix of data files and some index files, it lumps them altogether when making the combined splits and Mappers fail when they try to process the {{.lzo.index}} files as data. When using the original {{HiveInputFormat}}, it correctly identifies the {{.lzo.index}} files because it considers each file individually. Allow {{CombineHiveInputFormat}} to short-circuit LZO files and to not combine them. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21195) Review of DefaultGraphWalker Class
BELUGA BEHR created HIVE-21195: -- Summary: Review of DefaultGraphWalker Class Key: HIVE-21195 URL: https://issues.apache.org/jira/browse/HIVE-21195 Project: Hive Issue Type: Improvement Affects Versions: 4.0.0, 3.2.0 Reporter: BELUGA BEHR {code:java} protected final List toWalk = new ArrayList(); ... while (toWalk.size() > 0) { Node nd = toWalk.remove(0); {code} Every time this loop runs, the first item of a list is removed. For an {{ArrayList}}, this means that every time the first item is removed, all of the remaining items in the list are copied down one position so that the first item is always at array index 0. This is expensive in a tight loop. Use a {{Queue}} implementation that does not have this behavior. {{ArrayDeque}} {quote} This class is likely to be faster than Stack when used as a stack, and faster than LinkedList when used as a queue. {quote} https://docs.oracle.com/javase/7/docs/api/java/util/ArrayDeque.html Add a little bit extra cleanup since it's being looked at. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21210) CombineHiveInputFormat Thread Pool Sizing
BELUGA BEHR created HIVE-21210: -- Summary: CombineHiveInputFormat Thread Pool Sizing Key: HIVE-21210 URL: https://issues.apache.org/jira/browse/HIVE-21210 Project: Hive Issue Type: Improvement Affects Versions: 4.0.0, 3.2.0 Reporter: BELUGA BEHR Threadpools. Hive uses threadpools in several different places and each implementation is a little different and requires different configurations. I think that Hive needs to reign in and standardize the way that threadpools are used and threadpools should scale automatically without manual configuration. At any given time, there are many hundreds of threads running in the HS2 as the number of simultaneous connections increases and they surely cause contention with one-another. Here is an example: {code:java|title=CombineHiveInputFormat.java} // max number of threads we can use to check non-combinable paths private static final int MAX_CHECK_NONCOMBINABLE_THREAD_NUM = 50; private static final int DEFAULT_NUM_PATH_PER_THREAD = 100; {code} When building the splits for a MR job, there are up to 50 threads running per query and there is not much scaling here, it's simply 1 thread : 100 files ratio. This implies that to process 5000 files, there are 50 threads, after that, 50 threads are still used. Many Hive jobs these days involve more than 5000 files so it's not scaling well on bigger sizes. This is not configurable (even manually), it doesn't change when the hardware specs increase, and 50 threads seems like a lot when a service must support up to 80 connections: [https://www.cloudera.com/documentation/enterprise/5/latest/topics/admin_hive_tuning.html] Not to mention, I have never seen a scenario where HS2 is running on a host all by itself and has the entire system dedicated to it. Therefore it should be more friendly and spin up fewer threads. I am attaching a patch here that provides a few features: * Common module that produces {{ExecutorService}} which caps the number of threads it spins up at the number of processors a host has. Keep in mind that a class may submit as much work units ({{Callables}} as they would like, but the number of threads in the pool is capped. * Common module for partitioning work. That is, allow for a generic framework for dividing work into partitions (i.e. batches) * Modify {{CombineHiveInputFormat}} to take advantage of both modules, performing its same duties in a more Java OO way that is currently implemented * Add a partitioning (batching) implementation that enforces partitioning of a {{Collection}} based on the natural log of the {{Collection}} size so that it scales more slowly than a simple 1:100 ratio. * Simplify unit test code for {{CombineHiveInputFormat}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21240) JSON SerDe Re-Write
BELUGA BEHR created HIVE-21240: -- Summary: JSON SerDe Re-Write Key: HIVE-21240 URL: https://issues.apache.org/jira/browse/HIVE-21240 Project: Hive Issue Type: Improvement Components: Serializers/Deserializers Affects Versions: 3.1.1, 4.0.0 Reporter: BELUGA BEHR Fix For: 4.0.0 The JSON SerDe has a few issues, I will link them to this JIRA. * Use Jackson Tree parser instead of manually parsing * Added support for base-64 encoded data (the expected format when using JSON) * Added support to skip blank lines (returns all columns as null values) * Current JSON parser accepts, but does not apply, custom timestamp formats in most cases * Added some unit tests * Added cache for column-name to column-index searches, currently O\(n\) for each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21241) Migrate TimeStamp Parser From Joda Time
BELUGA BEHR created HIVE-21241: -- Summary: Migrate TimeStamp Parser From Joda Time Key: HIVE-21241 URL: https://issues.apache.org/jira/browse/HIVE-21241 Project: Hive Issue Type: Improvement Components: HiveServer2 Affects Versions: 3.2.0 Reporter: BELUGA BEHR Fix For: 4.0.0 Hive uses Joda time for its TimeStampParser. {quote} Joda-Time is the de facto standard date and time library for Java prior to Java SE 8. Users are now asked to migrate to java.time (JSR-310). https://www.joda.org/joda-time/ {quote} Migrate TimeStampParser to {{java.time}} I also added a couple new pre-canned timestamp parsers for convenience: * ISO 8601 * RFC 1123 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21242) Calcite Planner Logging Indicates UTF-16 Encoding
BELUGA BEHR created HIVE-21242: -- Summary: Calcite Planner Logging Indicates UTF-16 Encoding Key: HIVE-21242 URL: https://issues.apache.org/jira/browse/HIVE-21242 Project: Hive Issue Type: Improvement Components: CBO Affects Versions: 4.0.0, 3.2.0 Reporter: BELUGA BEHR I noticed some debug logging from calcite and it is using UTF-16. I would expect UTF-8. {code} 2019-02-10T19:08:06,393 DEBUG [7db4d3c5-0f88-49db-88fa-ad6428c23784 main] parse.CalcitePlanner: Plan after decorrelation: HiveSortLimit(offset=[0], fetch=[2]) HiveProject(_o__c0=[array(3, 2, 1)], _o__c1=[map(1, 2001-01-01, 2, null)], _o__c2=[named_struct(_UTF-16LE'c1', 123456, _UTF-16LE'c2', _UTF-16LE'hello', _UTF-16LE'c3', array(_UTF-16LE'aa', _UTF-16LE'bb', _UTF-16LE'cc'), _UTF-16LE'c4', map(_UTF-16LE'abc', 123, _UTF-16LE'xyz', 456), _UTF-16LE'c5', named_struct(_UTF-16LE'c5_1', _UTF-16LE'bye', _UTF-16LE'c5_2', 88))]) HiveTableScan(table=[[default, src]], table:alias=[src]) {code} I'm not sure if this is a calcite internal thing which can be configured or if this only an artifact of the way the logging works. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21243) Create Default DB in default.db
BELUGA BEHR created HIVE-21243: -- Summary: Create Default DB in default.db Key: HIVE-21243 URL: https://issues.apache.org/jira/browse/HIVE-21243 Project: Hive Issue Type: Improvement Reporter: BELUGA BEHR When a database is created in Hive, it is stored in {{/user/hive/warehouse/[MyDatabase.db]/}} It is very confusing that the Hive default database is not located in {{/user/hive/warehouse/[default.db]/}}. Please address this and make it consistent. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21245) Make hive.fetch.output.serde Default to LazySimpleSerde
BELUGA BEHR created HIVE-21245: -- Summary: Make hive.fetch.output.serde Default to LazySimpleSerde Key: HIVE-21245 URL: https://issues.apache.org/jira/browse/HIVE-21245 Project: Hive Issue Type: Improvement Components: Serializers/Deserializers Affects Versions: 4.0.0, 3.2.0 Reporter: BELUGA BEHR Fix For: 4.0.0 For all intents and purposes, it already is: {code:java|title=HiveSessionImpl.java} private static final String FETCH_WORK_SERDE_CLASS = "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"; @Override public HiveConf getHiveConf() { sessionConf.setVar(HiveConf.ConfVars.HIVEFETCHOUTPUTSERDE, FETCH_WORK_SERDE_CLASS); return sessionConf; } {code} https://github.com/apache/hive/blob/master/service/src/java/org/apache/hive/service/cli/session/HiveSessionImpl.java#L489-L492 Ultimately, I'd like to get rid of {{org.apache.hadoop.hive.serde2.DelimitedJSONSerDe}} altogether. It's a weird thing. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21246) Un-bury DelimitedJSONSerDe from PlanUtils.java
BELUGA BEHR created HIVE-21246: -- Summary: Un-bury DelimitedJSONSerDe from PlanUtils.java Key: HIVE-21246 URL: https://issues.apache.org/jira/browse/HIVE-21246 Project: Hive Issue Type: Improvement Reporter: BELUGA BEHR Attachments: HIVE-21246.1.patch Ultimately, I'd like to get rid of {{org.apache.hadoop.hive.serde2.DelimitedJSONSerDe}}, but for now, trying to make it easier to get rid of later. It's currently buried in {{PlanUtils.java}}. A SerDe and a flag gets passed into utilities. If the class boolean is set, the passed-in SerDe is overwritten. This is not documented anywhere and it's weird to do it, just pass in the SerDe to use instead of the SerDe you don't want to use and a flag to change it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21252) LazyTimestamp - Use String Equals
BELUGA BEHR created HIVE-21252: -- Summary: LazyTimestamp - Use String Equals Key: HIVE-21252 URL: https://issues.apache.org/jira/browse/HIVE-21252 Project: Hive Issue Type: Improvement Components: Serializers/Deserializers Affects Versions: 4.0.0, 3.2.0 Reporter: BELUGA BEHR {code:java|title=LazyTimestamp.java} if (s.compareTo("NULL") == 0) { isNull = true; logExceptionMessage(bytes, start, length, "TIMESTAMP"); } {code} compareTo generates a number to represent the differences between the two Strings. It's faster to simply call "equals" which will simply compare the two String directly and return a boolean. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21258) Streamline and Add Warning to PrimitiveObjectInspectorFactory.java
BELUGA BEHR created HIVE-21258: -- Summary: Streamline and Add Warning to PrimitiveObjectInspectorFactory.java Key: HIVE-21258 URL: https://issues.apache.org/jira/browse/HIVE-21258 Project: Hive Issue Type: Improvement Affects Versions: 4.0.0, 3.2.0 Reporter: BELUGA BEHR I just got bit by something pretty good, so I would like to propose adding a developer-facing warning into the logs to avoid this situation again. Also, fix up the related cache a bit. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21264) Improvements Around CharTypeInfo
BELUGA BEHR created HIVE-21264: -- Summary: Improvements Around CharTypeInfo Key: HIVE-21264 URL: https://issues.apache.org/jira/browse/HIVE-21264 Project: Hive Issue Type: Improvement Affects Versions: 4.0.0, 3.2.0 Reporter: BELUGA BEHR The {{CharTypeInfo}} stores the type name of the data type (char/varchar) and the length (1-255). {{CharTypeInfo}} objects are often getting cached once they are created. The {hashcode()} and {{equals()}} of its sub-classes varchar and char are inconsistent in this regard. * Make hashcode and equals consistent (and fast) * Simplify the {{getQualifiedName}} implementation and reduce the scope to protected * Other related nits -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21275) Lower Logging Level in Operator Class
BELUGA BEHR created HIVE-21275: -- Summary: Lower Logging Level in Operator Class Key: HIVE-21275 URL: https://issues.apache.org/jira/browse/HIVE-21275 Project: Hive Issue Type: Improvement Affects Versions: 4.0.0, 3.2.0 Reporter: BELUGA BEHR There is an incredible amount of logging generated by the {{Operator}} during the Q-Tests. I counted more than 1 *million* lines of pretty useless logging. Please lower to TRACE level. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21277) Make HBaseSerde First-Class SerDe
BELUGA BEHR created HIVE-21277: -- Summary: Make HBaseSerde First-Class SerDe Key: HIVE-21277 URL: https://issues.apache.org/jira/browse/HIVE-21277 Project: Hive Issue Type: New Feature Components: HBase Handler Reporter: BELUGA BEHR Make HBase integration with Hive first class. {code:sql} CREATE TABLE...STORED AS HBASE; {code} https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21289) Expect EQ and LIKE to Generate the Identical Explain Plans
BELUGA BEHR created HIVE-21289: -- Summary: Expect EQ and LIKE to Generate the Identical Explain Plans Key: HIVE-21289 URL: https://issues.apache.org/jira/browse/HIVE-21289 Project: Hive Issue Type: Improvement Components: Logical Optimizer Affects Versions: 2.3.4 Reporter: BELUGA BEHR I generated some test data with the UUID function. {code:sql} explain select * from test_like where a like 'abce6254-d437-426b-8873-2cbc153ddfbc'; explain select * from test_like where a = 'abce6254-d437-426b-8873-2cbc153ddfbc'; {code} {code|title=LIKE} Explain STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: test_like filterExpr: (a like 'abce6254-d437-426b-8873-2cbc153ddfbc') (type: boolean) Statistics: Num rows: 262144 Data size: 9437184 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (a like 'abce6254-d437-426b-8873-2cbc153ddfbc') (type: boolean) Statistics: Num rows: 131072 Data size: 4718592 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: a (type: string) outputColumnNames: _col0 Statistics: Num rows: 131072 Data size: 4718592 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 131072 Data size: 4718592 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink {code} {code|title=EQ} Explain STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: test_like filterExpr: (a = 'abce6254-d437-426b-8873-2cbc153ddfbc') (type: boolean) Statistics: Num rows: 262144 Data size: 9437184 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (a = 'abce6254-d437-426b-8873-2cbc153ddfbc') (type: boolean) Statistics: Num rows: 131072 Data size: 4718592 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: 'abce6254-d437-426b-8873-2cbc153ddfbc' (type: string) outputColumnNames: _col0 Statistics: Num rows: 131072 Data size: 4718592 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 131072 Data size: 4718592 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink {code} They may be the same under the covers, but I would expect the EXPLAIN plan to be exactly the same. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21303) Update TextRecordReader
BELUGA BEHR created HIVE-21303: -- Summary: Update TextRecordReader Key: HIVE-21303 URL: https://issues.apache.org/jira/browse/HIVE-21303 Project: Hive Issue Type: Improvement Components: Query Processor Affects Versions: 4.0.0, 3.2.0 Reporter: BELUGA BEHR Assignee: BELUGA BEHR Attachments: HIVE-21303.1.patch Remove use of Deprecated {{org.apache.hadoop.mapred.LineRecordReader.LineReader}} For every call to {{next}}, the code dives into the configuration map to see if this feature is enabled. Just look it up once and cache the value. {code:java} public int next(Writable row) throws IOException { ... if (HiveConf.getBoolVar(conf, HiveConf.ConfVars.HIVESCRIPTESCAPE)) { return HiveUtils.unescapeText((Text) row); } return bytesConsumed; } {code} Other clean up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21317) Unit Test kafka_storage_handler Is Failing Regularly
BELUGA BEHR created HIVE-21317: -- Summary: Unit Test kafka_storage_handler Is Failing Regularly Key: HIVE-21317 URL: https://issues.apache.org/jira/browse/HIVE-21317 Project: Hive Issue Type: Task Affects Versions: 4.0.0 Reporter: BELUGA BEHR {code} org.apache.hadoop.hive.cli.TestMiniHiveKafkaCliDriver.testCliDriver[kafka_storage_handler] (batchId=275) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21321) Remove Class HiveIOExceptionHandlerChain
BELUGA BEHR created HIVE-21321: -- Summary: Remove Class HiveIOExceptionHandlerChain Key: HIVE-21321 URL: https://issues.apache.org/jira/browse/HIVE-21321 Project: Hive Issue Type: Task Affects Versions: 4.0.0 Reporter: BELUGA BEHR I recently stumbled upon this code when tracking down some issue: {{HiveIOExceptionHandlerChain.java}} Is anyone using this feature? Is has a configuration associated with it {{hive.io.exception.handlers}}. The code doesn't seem to have any unit tests. Can this feature simply be removed? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21328) Call To Hadoop Text getBytes() Without Call to getLength()
BELUGA BEHR created HIVE-21328: -- Summary: Call To Hadoop Text getBytes() Without Call to getLength() Key: HIVE-21328 URL: https://issues.apache.org/jira/browse/HIVE-21328 Project: Hive Issue Type: Bug Components: Query Planning Affects Versions: 4.0.0, 3.2.0 Reporter: BELUGA BEHR I'm not sure if there is actually a bug, but this looks highly suspect: {code:java} public Object set(final Object o, final Text text) { return new BytesWritable(text == null ? null : text.getBytes()); } {code} https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/primitive/ParquetStringInspector.java#L104-L106 There are two components to a Text object. There are the internal bytes and the length of the bytes. The two are independent. I.e., a quick "reset" on the Text object simply sets the internal length counter to zero. This code is potentially looking at obsolete data that it shouldn't be seeing because it is not considering the length of the Text. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21347) Store Partition Count in TBLS
BELUGA BEHR created HIVE-21347: -- Summary: Store Partition Count in TBLS Key: HIVE-21347 URL: https://issues.apache.org/jira/browse/HIVE-21347 Project: Hive Issue Type: Improvement Components: Metastore Affects Versions: 4.0.0, 3.2.0 Reporter: BELUGA BEHR Please store a count of the number of partitions each table has in the ```TBLS``` table. This will allow very quick lookups for tables with many partitions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21352) Drop INDEX from 3.0 Schema
BELUGA BEHR created HIVE-21352: -- Summary: Drop INDEX from 3.0 Schema Key: HIVE-21352 URL: https://issues.apache.org/jira/browse/HIVE-21352 Project: Hive Issue Type: Improvement Components: Metastore Affects Versions: 4.0.0, 3.2.0 Reporter: BELUGA BEHR We dropped support for Hive indexes starting in 3.0, however there are still tables in Metastore to support it. Please remove. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21354) Lock The Entire Table If Majority Of Partitions Are Locked
BELUGA BEHR created HIVE-21354: -- Summary: Lock The Entire Table If Majority Of Partitions Are Locked Key: HIVE-21354 URL: https://issues.apache.org/jira/browse/HIVE-21354 Project: Hive Issue Type: Improvement Components: HiveServer2 Affects Versions: 4.0.0, 3.2.0 Reporter: BELUGA BEHR One of the bottlenecks of any Hive query is the ZooKeeper locking mechanism. When a Hive query interacts with a table which has a lot of partitions, this may put a lot of stress on the ZK system. Please add a heuristic that works like this: # Count the number of partitions that a query is required to lock # Obtain the total number of partitions in the table # If the number of partitions accessed by the query is greater than or equal to half the total number of partitions, simply create one ZNode lock at the table level. This would improve performance of many queries, but in particular, a {{select count(1) from table}} ... or ... {{select * from table limit 5}} where the table has many partitions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21356) Upgrade Jackson to 2.9.8
BELUGA BEHR created HIVE-21356: -- Summary: Upgrade Jackson to 2.9.8 Key: HIVE-21356 URL: https://issues.apache.org/jira/browse/HIVE-21356 Project: Hive Issue Type: Improvement Components: Serializers/Deserializers Affects Versions: 4.0.0, 3.2.0 Reporter: BELUGA BEHR Fix For: 4.0.0 Currently at: {code} 2.9.5 {code} Upgrade to 2.9.8 - contains some improvements for processing Base64 data. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21370) JsonSerDe cannot handle json file with empty lines - Branch 3
BELUGA BEHR created HIVE-21370: -- Summary: JsonSerDe cannot handle json file with empty lines - Branch 3 Key: HIVE-21370 URL: https://issues.apache.org/jira/browse/HIVE-21370 Project: Hive Issue Type: Improvement Components: Serializers/Deserializers Affects Versions: 3.1.0, 3.0.0, 3.2.0 Reporter: BELUGA BEHR Fix For: 3.2.0 Attachments: HIVE-21370.1.patch -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21371) Make NonSyncByteArrayOutputStream Overflow Conscious
BELUGA BEHR created HIVE-21371: -- Summary: Make NonSyncByteArrayOutputStream Overflow Conscious Key: HIVE-21371 URL: https://issues.apache.org/jira/browse/HIVE-21371 Project: Hive Issue Type: Improvement Affects Versions: 4.0.0, 3.2.0 Reporter: BELUGA BEHR Attachments: HIVE-21371.1.patch {code:java|title=NonSyncByteArrayOutputStream} private int enLargeBuffer(int increment) { int temp = count + increment; int newLen = temp; if (temp > buf.length) { if ((buf.length << 1) > temp) { newLen = buf.length << 1; } byte newbuf[] = new byte[newLen]; System.arraycopy(buf, 0, newbuf, 0, count); buf = newbuf; } return newLen; } {code} This will fail if the array is 2GB or larger because it will double the size every time without consideration for the 4GB limit on arrays. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21372) Use Apache Commons IO To Read Stream To String
BELUGA BEHR created HIVE-21372: -- Summary: Use Apache Commons IO To Read Stream To String Key: HIVE-21372 URL: https://issues.apache.org/jira/browse/HIVE-21372 Project: Hive Issue Type: Improvement Reporter: BELUGA BEHR Fix For: 4.0.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-17732) org.apache.hive.hcatalog.data.JsonSerDe.java
BELUGA BEHR created HIVE-17732: -- Summary: org.apache.hive.hcatalog.data.JsonSerDe.java Key: HIVE-17732 URL: https://issues.apache.org/jira/browse/HIVE-17732 Project: Hive Issue Type: Improvement Components: Serializers/Deserializers Affects Versions: 3.0.0 Reporter: BELUGA BEHR Priority: Trivial Some simple improvements for org.apache.hive.hcatalog.data.JsonSerDe Remove superfluous logging, cut down on object instantiation -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-17740) HiveConf 0 Use SLF4J Parameterization
BELUGA BEHR created HIVE-17740: -- Summary: HiveConf 0 Use SLF4J Parameterization Key: HIVE-17740 URL: https://issues.apache.org/jira/browse/HIVE-17740 Project: Hive Issue Type: Improvement Components: Configuration, Hive Reporter: BELUGA BEHR Priority: Trivial {{org.apache.hadoop.hive.conf.HiveConf}} # Parameterize the SLF4J logging and refactor log variable name to align with rest of code base # Couple of small nit-picks -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-17742) AccumuloIndexedOutputFormat Use SLF4J
BELUGA BEHR created HIVE-17742: -- Summary: AccumuloIndexedOutputFormat Use SLF4J Key: HIVE-17742 URL: https://issues.apache.org/jira/browse/HIVE-17742 Project: Hive Issue Type: Improvement Components: Accumulo Storage Handler Affects Versions: 3.0.0 Reporter: BELUGA BEHR Priority: Trivial {{org.apache.hadoop.hive.accumulo.mr.AccumuloIndexedOutputFormat}} # Change to use SL4J instead of core Log4J classes # Use SL4J parameterization -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-17793) Parameterize Logging Messages
BELUGA BEHR created HIVE-17793: -- Summary: Parameterize Logging Messages Key: HIVE-17793 URL: https://issues.apache.org/jira/browse/HIVE-17793 Project: Hive Issue Type: Improvement Components: Hive Affects Versions: 3.0.0 Reporter: BELUGA BEHR Assignee: BELUGA BEHR Priority: Trivial * Use SLF4J parameterized logging * Remove use of archaic Util's "stringifyException" and simply allow logging framework to handle formatting of output. Also saves having to create the error message and then throwing it away when the logging level is set higher than the logging message * Add some {{LOG.isDebugEnabled}} around complex debug messages -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-17799) Add Ellipsis For Truncated Query In Hive Lock
BELUGA BEHR created HIVE-17799: -- Summary: Add Ellipsis For Truncated Query In Hive Lock Key: HIVE-17799 URL: https://issues.apache.org/jira/browse/HIVE-17799 Project: Hive Issue Type: Improvement Components: Hive Affects Versions: 3.0.0 Reporter: BELUGA BEHR Assignee: BELUGA BEHR Priority: Trivial [HIVE-16334] introduced truncation for storing queries in ZK lock nodes. This Jira is to add ellipsis into the query to let the operator know that truncation has occurred and therefore they will not find the specific query in their logs, only a prefix match will work. {code:sql} -- Truncation of query may be confusing to operator -- Without truncation SELECT * FROM TABLE WHERE COL=1 -- With truncation (will not find this query in workload) SELECT * FROM TABLE {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-17901) org.apache.hadoop.hive.ql.exec.Utilities - Use Logging Parameterization and More
BELUGA BEHR created HIVE-17901: -- Summary: org.apache.hadoop.hive.ql.exec.Utilities - Use Logging Parameterization and More Key: HIVE-17901 URL: https://issues.apache.org/jira/browse/HIVE-17901 Project: Hive Issue Type: Improvement Components: Hive Affects Versions: 3.0.0 Reporter: BELUGA BEHR Priority: Minor {{org.apache.hadoop.hive.ql.exec.Utilities}} # Remove unused imports # Remove unused variables # Modify logging to use logging parameterization # Other small tweeks -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-17911) org.apache.hadoop.hive.metastore.ObjectStore - Tune Up
BELUGA BEHR created HIVE-17911: -- Summary: org.apache.hadoop.hive.metastore.ObjectStore - Tune Up Key: HIVE-17911 URL: https://issues.apache.org/jira/browse/HIVE-17911 Project: Hive Issue Type: Improvement Components: Hive Affects Versions: 3.0.0 Reporter: BELUGA BEHR Priority: Minor # Remove unused variables # Add logging parameterization # Use CollectionUtils.isEmpty/isNotEmpty to simplify and unify collection empty check (and always use null check) # Minor tweaks -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-17912) org.apache.hadoop.hive.metastore.security.DBTokenStore - Parameterize Logging
BELUGA BEHR created HIVE-17912: -- Summary: org.apache.hadoop.hive.metastore.security.DBTokenStore - Parameterize Logging Key: HIVE-17912 URL: https://issues.apache.org/jira/browse/HIVE-17912 Project: Hive Issue Type: Improvement Components: HiveServer2 Affects Versions: 3.0.0 Reporter: BELUGA BEHR Priority: Trivial -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-17941) Don't Re-Create Running Job Client During Status Checks
BELUGA BEHR created HIVE-17941: -- Summary: Don't Re-Create Running Job Client During Status Checks Key: HIVE-17941 URL: https://issues.apache.org/jira/browse/HIVE-17941 Project: Hive Issue Type: Improvement Components: HiveServer2 Affects Versions: 3.0.0, 2.3.1 Reporter: BELUGA BEHR {code:java|title=org.apache.hadoop.hive.ql.exec.mr.HadoopJobExecHelper} while (!rj.isComplete()) { ... RunningJob newRj = jc.getJob(rj.getID()); if (newRj == null) { // under exceptional load, hadoop may not be able to look up status // of finished jobs (because it has purged them from memory). From // hive's perspective - it's equivalent to the job having failed. // So raise a meaningful exception throw new IOException("Could not find status of job:" + rj.getID()); } else { th.setRunningJob(newRj); rj = newRj; } } ... } {code} https://github.com/apache/hive/blob/a9f25c0e7ad3f81a9f00f601947a161516e33f1b/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/HadoopJobExecHelper.java#L295-L306 Every time we loop here for a status update, we are rebuilding the RunningJob object to test if the Job information is still loaded in YARN. Rebuilding this RunningJob object is not trivial because it requires that we re-load and parse the Job Configuration XML file every time. {code:java|title=Outdated Stacktrace But Same Idea Holds} at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.(FileInputStream.java:120) at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1924) at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:1877) at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:1785) at org.apache.hadoop.conf.Configuration.get(Configuration.java:712) at org.apache.hadoop.mapred.JobConf.checkAndWarnDeprecation(JobConf.java:1951) at org.apache.hadoop.mapred.JobConf.(JobConf.java:398) at org.apache.hadoop.mapred.JobConf.(JobConf.java:388) at org.apache.hadoop.mapred.JobClient$NetworkedJob.(JobClient.java:174) at org.apache.hadoop.mapred.JobClient.getJob(JobClient.java:655) at org.apache.hadoop.mapred.JobClient.getJob(JobClient.java:668) at org.apache.hadoop.hive.ql.exec.HadoopJobExecHelper.progress(HadoopJobExecHelper.java:282) at org.apache.hadoop.hive.ql.exec.HadoopJobExecHelper.progress(HadoopJobExecHelper.java:532) {code} Maybe we can be use {{isRetired()}} instead for this particular check. We also probably need to be better about checking the return value from any of the {{RunningJob}} methods if it's the case that they can fail/go-away at any time if YARN purges the information. It seems that perhaps this was an attempt to detect a purged job before exercising the {{RunningJob}} object... even though it can go bad at any point. https://hadoop.apache.org/docs/r2.7.1/api/org/apache/hadoop/mapred/RunningJob.html -- This message was sent by Atlassian JIRA (v6.4.14#64029)