[jira] [Created] (HIVE-26941) Make SetProcessor configurable to ignore some set variables

2023-01-15 Thread Miklos Szurap (Jira)
Miklos Szurap created HIVE-26941:


 Summary: Make SetProcessor configurable to ignore some set 
variables
 Key: HIVE-26941
 URL: https://issues.apache.org/jira/browse/HIVE-26941
 Project: Hive
  Issue Type: New Feature
  Components: Configuration, Hive
Reporter: Miklos Szurap
Assignee: Miklos Szurap


In certain environments after upgrades we need to restrict users from changing 
some Hive configurations during runtime (for example "mapreduce.job.queuename" 
or "hive.execution.engine"). 
The "hive.security.authorization.sqlstd.confwhitelist" could  be used for this, 
however:
* it is complex to modify that sometimes (for example to exclude a config which 
is otherwise allowed with a wildcard) 
* when a user script tries to set a parameter not in the 
"hive.security.authorization.sqlstd.confwhitelist" then the whole script just 
fails with "Error: Error while processing statement: Cannot modify  at 
runtime. It is not in list of params that are allowed to be modified at 
runtime". This would require all the user scripts and jobs to be modified (to 
remove that "set" command), that can be a huge effort.

With a new configuration item in hive-site.xml cluster operators can configure 
HiveServer2 to ignore the "set" command requests - essentially making those 
settings "final" on HiveServer2 level. Trying to change these "final" settings 
would not fail the scripts - just ignore their request.

In this jira:
- add a new config "hive.conf.ignored.variable.list"
- accepts strings, comma separated list of variables
- the config is empty by default, it can be set in hive-site.xml only
- adding the "hive.conf.ignored.variable.list" to the restricted list 
("hive.conf.restricted.list") internally - so it cannot be modified during 
runtime
- adding tests for the changes



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-26629) Misleading error message with hive.metastore.limit.partition.request

2022-10-13 Thread Miklos Szurap (Jira)
Miklos Szurap created HIVE-26629:


 Summary: Misleading error message with 
hive.metastore.limit.partition.request 
 Key: HIVE-26629
 URL: https://issues.apache.org/jira/browse/HIVE-26629
 Project: Hive
  Issue Type: Bug
  Components: HiveServer2, Metastore
Reporter: Miklos Szurap


Dropping partitions from a table fails with a misleading error message saying 
that "partition not found":
{code}
0: jdbc:hive2://nightly-71x-zx-1.nightly-71x-> alter table t1p drop partition 
(p1>0);
Error: Error while compiling statement: FAILED: SemanticException [Error 
10006]: Partition not found (p1 > 0) (state=42000,code=10006)
{code}
however the partitions exist, the real error message is visible in the 
HiveServer2 logs:
{code}
Caused by: MetaException(message:Number of partitions scanned (=2) on table 
't1p' exceeds limit (=1). This is controlled on the metastore server by 
hive.metastore.limit.partition.request.)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partitions_by_expr_result$get_partitions_by_expr_resultStandardScheme.read(ThriftHiveMetastore.java)
 
...
{code}

Hive should surface the real error message to the user, that the 
"hive.metastore.limit.partition.request" limit has been reached.

This happens only when "hive.metastore.limit.partition.request" is set.
Haven't verified other types of queries, potentially other queries fail 
similarly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-25879) MetaStoreDirectSql test query should not query the DBS table

2022-01-19 Thread Miklos Szurap (Jira)
Miklos Szurap created HIVE-25879:


 Summary: MetaStoreDirectSql test query should not query the DBS 
table
 Key: HIVE-25879
 URL: https://issues.apache.org/jira/browse/HIVE-25879
 Project: Hive
  Issue Type: Bug
Reporter: Miklos Szurap


The runTestQuery() in the 
org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java is using a test query
{code:java}
select "DB_ID" from "DBS"{code}
to determine whether the direct SQL can be used.

With larger deployments with many (10k+) Hive databases it would be more 
efficienct to query a small table instead, for example the "VERSION" table 
should always have a single row only.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HIVE-25466) Trim spaces from db and table names

2021-08-18 Thread Miklos Szurap (Jira)
Miklos Szurap created HIVE-25466:


 Summary: Trim spaces from db and table names
 Key: HIVE-25466
 URL: https://issues.apache.org/jira/browse/HIVE-25466
 Project: Hive
  Issue Type: Bug
  Components: Parser, SQL
Reporter: Miklos Szurap


If we create databases and tables with leading/trailing whitespaces (using 
backticks) the behavior is inconsistent and leads to multiple problems.

Creating database with spaces makes it part of the database name, from then on 
they must be used with backticks.
{code}
0: jdbc:hive2://hs2> create database `mydb1 `;
INFO  : OK
0: jdbc:hive2://hs2> desc database `mydb1 `;
Location: "/warehouse/tablespace/external/hive/mydb1 .db"
{code}
With leading spaces the database can be created, but it can't be referenced 
anymore:
{code}
0: jdbc:hive2://hs2> create database ` mydb2`;
INFO  : OK
0: jdbc:hive2://hs2> desc database ` mydb2`;
Error: Error while compiling statement: FAILED: SemanticException [Error 
10072]: Database does not exist:  mydb2 (state=42000,code=10072)

0: jdbc:hive2://hs2> !outputformat xmlattr
0: jdbc:hive2://hs2> show databases;

  
  
  
  
  

{code}
For tables the spaces are trimmed - the tables are created without leading or 
trailing spaces in their names. However as the below example shows the space is 
kept within the table location.
{code}
0: jdbc:hive2://hs2> create external table `mytbl1 ` (col1 string);
INFO  : OK
0: jdbc:hive2://hs2> desc formatted `mytbl1 `;
Location: "hdfs://namenode:8020/warehouse/tablespace/external/hive/mytbl1 "

0: jdbc:hive2://hs2> create external table ` mytbl2` (col1 string);
INFO  : OK
0: jdbc:hive2://hs2> desc formatted ` mytbl2`;
Location: "hdfs://namenode:8020/warehouse/tablespace/external/hive/ mytbl2"

0: jdbc:hive2://hs2> show tables;

  
  

{code}
Interestingly during table creation or other operations like "use" the database 
name's is trimmed.
{code}
0: jdbc:hive2://hs2> create database mydb3;
INFO  : OK
0: jdbc:hive2://hs2> create table ` mydb3`.`mytbl3` (col1 string);
INFO  : OK
0: jdbc:hive2://hs2> use `  mydb3  `;
INFO  : OK
0: jdbc:hive2://hs2> show tables;

  

{code}
One can validate with hdfs commands that the locations have even the trailing 
spaces. Keeping the space in the HDFS location is inconsistent with the table 
names and also confusing in multiple ways (like you cannot see the trailing 
space), and sounds a very bad pattern.

This is even more problematic because in the underlying HMS database the 
{{NOTIFICATION_LOG}} entries are created _with_ the spaces as it has been 
passed in the SQL statement even if the table name is trimmed - which is 
providing incorrect information to the other components relying on the 
{{NOTIFICATION_LOG}}.

Hive should trim completely the database and table names in the SQL statements 
- without propagating that forward.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-25074) Remove Metastore flushCache usage

2021-04-29 Thread Miklos Szurap (Jira)
Miklos Szurap created HIVE-25074:


 Summary: Remove Metastore flushCache usage
 Key: HIVE-25074
 URL: https://issues.apache.org/jira/browse/HIVE-25074
 Project: Hive
  Issue Type: Improvement
  Components: Metastore, Standalone Metastore
Affects Versions: 4.0.0
Reporter: Miklos Szurap


The "flushCache" in HiveMetaStore with the ObjectStore implementation is 
currently a NOOP:
{code:java}
  public void flushCache() {
// NOP as there's no caching
  } {code}
The HBaseStore (HBaseReadWrite) had some logic in it, however it has been 
removed in HIVE-17234.

As I see the calls are going like this:

HiveMetaStoreClient.flushCache() -> CachedStore.flushCache() -> 
ObjectStore.flushCache()

There are significant amount of calls (about 10% of all calls) made from the 
client to the server - to do nothing. We could spare the call to the server 
completely, including getting a DB connection which can take 1+ seconds under 
high load scenarios slowing down Hive queries unnecessarily.

Can we:
 # Deprecate the RawStore.flushCache (if there are other implementations)
 # Deprecate the HiveMetaStoreClient.flushCache()
 # Do the NOOP on the client side in HiveMetaStoreClient.flushCache() (while it 
is not removed in a next version)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24843) Remove unnecessary throw-catch in Deadline

2021-03-04 Thread Miklos Szurap (Jira)
Miklos Szurap created HIVE-24843:


 Summary: Remove unnecessary throw-catch in Deadline
 Key: HIVE-24843
 URL: https://issues.apache.org/jira/browse/HIVE-24843
 Project: Hive
  Issue Type: Bug
  Components: Standalone Metastore
Reporter: Miklos Szurap


The 
[Deadline|https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/Deadline.java]
 has a throw-catch which is unnecessary. Previusly HIVE-16450 has refactored 
most of the exceptions, but missed it at the check() method.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23237) Display HvieServer2 hostname in the operation logs

2020-04-17 Thread Miklos Szurap (Jira)
Miklos Szurap created HIVE-23237:


 Summary: Display HvieServer2 hostname in the operation logs
 Key: HIVE-23237
 URL: https://issues.apache.org/jira/browse/HIVE-23237
 Project: Hive
  Issue Type: Improvement
Reporter: Miklos Szurap


Hive deployments often have an external load-balancer in front of multiple 
HiveServer2 instances. 
In such cases the client does not know which HiveServer2 it is connected to. If 
there are some issues all HiveServer2 logs have to be searched for clues 
instead of directly going to the right host. It would be great if the HS2 
hostname was logged to the client logs (for example to beeline's output). 
We can "work around" by printing out this information with executing a "set 
hive.server2.thrift.bind.host;" however that requires an explicit modification 
to every application. 
Can we print this information in the operation logs and that way streaming it 
back to the client? 
Likely some users - customers do not want to expose that, so the behavior 
should be configurable.
This could make the issue/error investigation much easier.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-21455) Too verbose logging in AvroGenericRecordReader

2019-03-15 Thread Miklos Szurap (JIRA)
Miklos Szurap created HIVE-21455:


 Summary: Too verbose logging in AvroGenericRecordReader
 Key: HIVE-21455
 URL: https://issues.apache.org/jira/browse/HIVE-21455
 Project: Hive
  Issue Type: Improvement
  Components: HiveServer2
Reporter: Miklos Szurap


{{AvroGenericRecordReader}} logs the Avro schema for each datafile. It is too 
verbose, likely we don't need to log that on INFO level.
For example a table:
{noformat}
create table avro_tbl (c1 string, c2 int, c3 float) stored as avro;
{noformat}
and querying it with a select star - with 3 datafiles HiveServer2 logs the 
following:
{noformat}
2019-03-15 09:18:35,999 INFO  org.apache.hadoop.mapred.FileInputFormat: 
[HiveServer2-Handler-Pool: Thread-64]: Total input paths to process : 3
2019-03-15 09:18:35,999 INFO  
org.apache.hadoop.hive.ql.io.avro.AvroGenericRecordReader: 
[HiveServer2-Handler-Pool: Thread-64]: Found the avro schema in the job: 
{"type":"record","name":"avro_tbl","namespace":"test","fields":[{"name":"c1","type":["null","string"],"default":null},{"name":"c2","type":["null","int"],"default":null},{"name":"c3","type":["null","float"],"default":null}]}
2019-03-15 09:18:36,004 INFO  
org.apache.hadoop.hive.ql.io.avro.AvroGenericRecordReader: 
[HiveServer2-Handler-Pool: Thread-64]: Found the avro schema in the job: 
{"type":"record","name":"avro_tbl","namespace":"test","fields":[{"name":"c1","type":["null","string"],"default":null},{"name":"c2","type":["null","int"],"default":null},{"name":"c3","type":["null","float"],"default":null}]}
2019-03-15 09:18:36,010 INFO  
org.apache.hadoop.hive.ql.io.avro.AvroGenericRecordReader: 
[HiveServer2-Handler-Pool: Thread-64]: Found the avro schema in the job: 
{"type":"record","name":"avro_tbl","namespace":"test","fields":[{"name":"c1","type":["null","string"],"default":null},{"name":"c2","type":["null","int"],"default":null},{"name":"c3","type":["null","float"],"default":null}]}
{noformat}
This has a huge performance and storage penalty on a table with big schema and 
thousands of datafiles.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)