[jira] [Created] (HIVE-26941) Make SetProcessor configurable to ignore some set variables
Miklos Szurap created HIVE-26941: Summary: Make SetProcessor configurable to ignore some set variables Key: HIVE-26941 URL: https://issues.apache.org/jira/browse/HIVE-26941 Project: Hive Issue Type: New Feature Components: Configuration, Hive Reporter: Miklos Szurap Assignee: Miklos Szurap In certain environments after upgrades we need to restrict users from changing some Hive configurations during runtime (for example "mapreduce.job.queuename" or "hive.execution.engine"). The "hive.security.authorization.sqlstd.confwhitelist" could be used for this, however: * it is complex to modify that sometimes (for example to exclude a config which is otherwise allowed with a wildcard) * when a user script tries to set a parameter not in the "hive.security.authorization.sqlstd.confwhitelist" then the whole script just fails with "Error: Error while processing statement: Cannot modify at runtime. It is not in list of params that are allowed to be modified at runtime". This would require all the user scripts and jobs to be modified (to remove that "set" command), that can be a huge effort. With a new configuration item in hive-site.xml cluster operators can configure HiveServer2 to ignore the "set" command requests - essentially making those settings "final" on HiveServer2 level. Trying to change these "final" settings would not fail the scripts - just ignore their request. In this jira: - add a new config "hive.conf.ignored.variable.list" - accepts strings, comma separated list of variables - the config is empty by default, it can be set in hive-site.xml only - adding the "hive.conf.ignored.variable.list" to the restricted list ("hive.conf.restricted.list") internally - so it cannot be modified during runtime - adding tests for the changes -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-26629) Misleading error message with hive.metastore.limit.partition.request
Miklos Szurap created HIVE-26629: Summary: Misleading error message with hive.metastore.limit.partition.request Key: HIVE-26629 URL: https://issues.apache.org/jira/browse/HIVE-26629 Project: Hive Issue Type: Bug Components: HiveServer2, Metastore Reporter: Miklos Szurap Dropping partitions from a table fails with a misleading error message saying that "partition not found": {code} 0: jdbc:hive2://nightly-71x-zx-1.nightly-71x-> alter table t1p drop partition (p1>0); Error: Error while compiling statement: FAILED: SemanticException [Error 10006]: Partition not found (p1 > 0) (state=42000,code=10006) {code} however the partitions exist, the real error message is visible in the HiveServer2 logs: {code} Caused by: MetaException(message:Number of partitions scanned (=2) on table 't1p' exceeds limit (=1). This is controlled on the metastore server by hive.metastore.limit.partition.request.) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partitions_by_expr_result$get_partitions_by_expr_resultStandardScheme.read(ThriftHiveMetastore.java) ... {code} Hive should surface the real error message to the user, that the "hive.metastore.limit.partition.request" limit has been reached. This happens only when "hive.metastore.limit.partition.request" is set. Haven't verified other types of queries, potentially other queries fail similarly. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-25879) MetaStoreDirectSql test query should not query the DBS table
Miklos Szurap created HIVE-25879: Summary: MetaStoreDirectSql test query should not query the DBS table Key: HIVE-25879 URL: https://issues.apache.org/jira/browse/HIVE-25879 Project: Hive Issue Type: Bug Reporter: Miklos Szurap The runTestQuery() in the org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java is using a test query {code:java} select "DB_ID" from "DBS"{code} to determine whether the direct SQL can be used. With larger deployments with many (10k+) Hive databases it would be more efficienct to query a small table instead, for example the "VERSION" table should always have a single row only. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HIVE-25466) Trim spaces from db and table names
Miklos Szurap created HIVE-25466: Summary: Trim spaces from db and table names Key: HIVE-25466 URL: https://issues.apache.org/jira/browse/HIVE-25466 Project: Hive Issue Type: Bug Components: Parser, SQL Reporter: Miklos Szurap If we create databases and tables with leading/trailing whitespaces (using backticks) the behavior is inconsistent and leads to multiple problems. Creating database with spaces makes it part of the database name, from then on they must be used with backticks. {code} 0: jdbc:hive2://hs2> create database `mydb1 `; INFO : OK 0: jdbc:hive2://hs2> desc database `mydb1 `; Location: "/warehouse/tablespace/external/hive/mydb1 .db" {code} With leading spaces the database can be created, but it can't be referenced anymore: {code} 0: jdbc:hive2://hs2> create database ` mydb2`; INFO : OK 0: jdbc:hive2://hs2> desc database ` mydb2`; Error: Error while compiling statement: FAILED: SemanticException [Error 10072]: Database does not exist: mydb2 (state=42000,code=10072) 0: jdbc:hive2://hs2> !outputformat xmlattr 0: jdbc:hive2://hs2> show databases; {code} For tables the spaces are trimmed - the tables are created without leading or trailing spaces in their names. However as the below example shows the space is kept within the table location. {code} 0: jdbc:hive2://hs2> create external table `mytbl1 ` (col1 string); INFO : OK 0: jdbc:hive2://hs2> desc formatted `mytbl1 `; Location: "hdfs://namenode:8020/warehouse/tablespace/external/hive/mytbl1 " 0: jdbc:hive2://hs2> create external table ` mytbl2` (col1 string); INFO : OK 0: jdbc:hive2://hs2> desc formatted ` mytbl2`; Location: "hdfs://namenode:8020/warehouse/tablespace/external/hive/ mytbl2" 0: jdbc:hive2://hs2> show tables; {code} Interestingly during table creation or other operations like "use" the database name's is trimmed. {code} 0: jdbc:hive2://hs2> create database mydb3; INFO : OK 0: jdbc:hive2://hs2> create table ` mydb3`.`mytbl3` (col1 string); INFO : OK 0: jdbc:hive2://hs2> use ` mydb3 `; INFO : OK 0: jdbc:hive2://hs2> show tables; {code} One can validate with hdfs commands that the locations have even the trailing spaces. Keeping the space in the HDFS location is inconsistent with the table names and also confusing in multiple ways (like you cannot see the trailing space), and sounds a very bad pattern. This is even more problematic because in the underlying HMS database the {{NOTIFICATION_LOG}} entries are created _with_ the spaces as it has been passed in the SQL statement even if the table name is trimmed - which is providing incorrect information to the other components relying on the {{NOTIFICATION_LOG}}. Hive should trim completely the database and table names in the SQL statements - without propagating that forward. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-25074) Remove Metastore flushCache usage
Miklos Szurap created HIVE-25074: Summary: Remove Metastore flushCache usage Key: HIVE-25074 URL: https://issues.apache.org/jira/browse/HIVE-25074 Project: Hive Issue Type: Improvement Components: Metastore, Standalone Metastore Affects Versions: 4.0.0 Reporter: Miklos Szurap The "flushCache" in HiveMetaStore with the ObjectStore implementation is currently a NOOP: {code:java} public void flushCache() { // NOP as there's no caching } {code} The HBaseStore (HBaseReadWrite) had some logic in it, however it has been removed in HIVE-17234. As I see the calls are going like this: HiveMetaStoreClient.flushCache() -> CachedStore.flushCache() -> ObjectStore.flushCache() There are significant amount of calls (about 10% of all calls) made from the client to the server - to do nothing. We could spare the call to the server completely, including getting a DB connection which can take 1+ seconds under high load scenarios slowing down Hive queries unnecessarily. Can we: # Deprecate the RawStore.flushCache (if there are other implementations) # Deprecate the HiveMetaStoreClient.flushCache() # Do the NOOP on the client side in HiveMetaStoreClient.flushCache() (while it is not removed in a next version) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-24843) Remove unnecessary throw-catch in Deadline
Miklos Szurap created HIVE-24843: Summary: Remove unnecessary throw-catch in Deadline Key: HIVE-24843 URL: https://issues.apache.org/jira/browse/HIVE-24843 Project: Hive Issue Type: Bug Components: Standalone Metastore Reporter: Miklos Szurap The [Deadline|https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/Deadline.java] has a throw-catch which is unnecessary. Previusly HIVE-16450 has refactored most of the exceptions, but missed it at the check() method. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-23237) Display HvieServer2 hostname in the operation logs
Miklos Szurap created HIVE-23237: Summary: Display HvieServer2 hostname in the operation logs Key: HIVE-23237 URL: https://issues.apache.org/jira/browse/HIVE-23237 Project: Hive Issue Type: Improvement Reporter: Miklos Szurap Hive deployments often have an external load-balancer in front of multiple HiveServer2 instances. In such cases the client does not know which HiveServer2 it is connected to. If there are some issues all HiveServer2 logs have to be searched for clues instead of directly going to the right host. It would be great if the HS2 hostname was logged to the client logs (for example to beeline's output). We can "work around" by printing out this information with executing a "set hive.server2.thrift.bind.host;" however that requires an explicit modification to every application. Can we print this information in the operation logs and that way streaming it back to the client? Likely some users - customers do not want to expose that, so the behavior should be configurable. This could make the issue/error investigation much easier. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-21455) Too verbose logging in AvroGenericRecordReader
Miklos Szurap created HIVE-21455: Summary: Too verbose logging in AvroGenericRecordReader Key: HIVE-21455 URL: https://issues.apache.org/jira/browse/HIVE-21455 Project: Hive Issue Type: Improvement Components: HiveServer2 Reporter: Miklos Szurap {{AvroGenericRecordReader}} logs the Avro schema for each datafile. It is too verbose, likely we don't need to log that on INFO level. For example a table: {noformat} create table avro_tbl (c1 string, c2 int, c3 float) stored as avro; {noformat} and querying it with a select star - with 3 datafiles HiveServer2 logs the following: {noformat} 2019-03-15 09:18:35,999 INFO org.apache.hadoop.mapred.FileInputFormat: [HiveServer2-Handler-Pool: Thread-64]: Total input paths to process : 3 2019-03-15 09:18:35,999 INFO org.apache.hadoop.hive.ql.io.avro.AvroGenericRecordReader: [HiveServer2-Handler-Pool: Thread-64]: Found the avro schema in the job: {"type":"record","name":"avro_tbl","namespace":"test","fields":[{"name":"c1","type":["null","string"],"default":null},{"name":"c2","type":["null","int"],"default":null},{"name":"c3","type":["null","float"],"default":null}]} 2019-03-15 09:18:36,004 INFO org.apache.hadoop.hive.ql.io.avro.AvroGenericRecordReader: [HiveServer2-Handler-Pool: Thread-64]: Found the avro schema in the job: {"type":"record","name":"avro_tbl","namespace":"test","fields":[{"name":"c1","type":["null","string"],"default":null},{"name":"c2","type":["null","int"],"default":null},{"name":"c3","type":["null","float"],"default":null}]} 2019-03-15 09:18:36,010 INFO org.apache.hadoop.hive.ql.io.avro.AvroGenericRecordReader: [HiveServer2-Handler-Pool: Thread-64]: Found the avro schema in the job: {"type":"record","name":"avro_tbl","namespace":"test","fields":[{"name":"c1","type":["null","string"],"default":null},{"name":"c2","type":["null","int"],"default":null},{"name":"c3","type":["null","float"],"default":null}]} {noformat} This has a huge performance and storage penalty on a table with big schema and thousands of datafiles. -- This message was sent by Atlassian JIRA (v7.6.3#76005)