[ https://issues.apache.org/jira/browse/AMBARI-22701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hoc Phan updated AMBARI-22701: ------------------------------ Description: alert_hive_metastore.py will cause orphan processes running over time. Below is one example: {code:none} 1001 593317 593316 0 Dec24 ? 00:00:00 -bash -c export PATH='/usr/sbin:/sbin:/usr/ lib/ambari-server/*:/sbin:/usr/sbin:/bin:/usr/bin:/var/lib/ambari-agent:/bin/:/usr/bin/:/usr/s bin/:/usr/hdp/current/hive-metastore/bin' ; export HIVE_CONF_DIR="/usr/hdp/current/hive-metastore/conf/conf.server" ; hive --hiveconf hive.metastore.uris=thrift://demo.local:9083 --hiveconf hive.metastore.client.connect.retry.delay=1 --hiveconf hive.metastore.failure.retries=1 --hiveconf hive.metastore.connect.retries=1 --hiveconf hive.metastore.client.socket.timeout=14 --hiveconf hive.execution.engine=mr -e "show databases;" {code} There could be thousands of those over many months in the host with Hive Metastore. To check, run below two commands: {code:none} ps -ef | grep "[s]how databases" | wc -l ps h -Led -o user | sort | uniq -c | sort -n {code} This will hit nproc limit and crash other services in the same host. The fixes are: 1. Swap to "hive" user instead of "ambari-qa" user: https://issues.apache.org/jira/browse/AMBARI-22142 2. Change hive CLI to beeline: https://issues.apache.org/jira/browse/AMBARI-17006 For some reasons, the hive CLI processes don't get killed and kept "lingering" around. Proposed fix in /var/lib/ambari-server/resources/common-services/HIVE/0.12.0.2.0/package/alerts Instructions: 1. Add below lines below "HIVE_METASTORE_URIS_KEY = '{{hive-site/hive.metastore.uris}}'" {code:none} HIVE_SERVER_THRIFT_PORT_KEY = '{{hive-site/hive.server2.thrift.port}}' HIVE_SERVER_THRIFT_HTTP_PORT_KEY = '{{hive-site/hive.server2.thrift.http.port}}' HIVE_SERVER_TRANSPORT_MODE_KEY = '{{hive-site/hive.server2.transport.mode}}' THRIFT_PORT_DEFAULT = 10000 HIVE_SERVER_TRANSPORT_MODE_DEFAULT = 'binary' {code} 2. Change SMOKEUSER_DEFAULT = 'ambari-qa' to: {code:none} SMOKEUSER_DEFAULT = 'hive' {code} 3. Replace {code:none} return (SECURITY_ENABLED_KEY,SMOKEUSER_KEYTAB_KEY,SMOKEUSER_PRINCIPAL_KEY, HIVE_METASTORE_URIS_KEY, SMOKEUSER_KEY, KERBEROS_EXECUTABLE_SEARCH_PATHS_KEY, STACK_ROOT) {code} with this: {code:none} return (SECURITY_ENABLED_KEY,SMOKEUSER_KEYTAB_KEY,SMOKEUSER_PRINCIPAL_KEY, HIVE_METASTORE_URIS_KEY, SMOKEUSER_KEY, KERBEROS_EXECUTABLE_SEARCH_PATHS_KEY, STACK_ROOT, HIVE_SERVER_THRIFT_PORT_KEY, HIVE_SERVER_THRIFT_HTTP_PORT_KEY, HIVE_SERVER_TRANSPORT_MODE_KEY) {code} 4. Replace this {code:none} return (HIVE_METASTORE_URIS_KEY, HADOOPUSER_KEY) {code} with this: {code:none} return (HIVE_SERVER_THRIFT_PORT_KEY, HIVE_SERVER_THRIFT_HTTP_PORT_KEY, HIVE_SERVER_TRANSPORT_MODE_KEY, HIVE_METASTORE_URIS_KEY, HADOOPUSER_KEY) {code} 5. Comment out these lines because it will kept injecting ambari-qa user back {code:none} #if SMOKEUSER_KEY in configurations: # smokeuser = configurations[SMOKEUSER_KEY] {code} 6. Replace this code block: {code:none} cmd = format("export HIVE_CONF_DIR='{conf_dir}' ; " "hive --hiveconf hive.metastore.uris={metastore_uri}\ --hiveconf hive.metastore.client.connect.retry.delay=1\ --hiveconf hive.metastore.failure.retries=1\ --hiveconf hive.metastore.connect.retries=1\ --hiveconf hive.metastore.client.socket.timeout=14\ --hiveconf hive.execution.engine=mr -e 'show databases;'") {code} with this block: {code:none} transport_mode = HIVE_SERVER_TRANSPORT_MODE_DEFAULT if HIVE_SERVER_TRANSPORT_MODE_KEY in configurations: transport_mode = configurations[HIVE_SERVER_TRANSPORT_MODE_KEY] port = THRIFT_PORT_DEFAULT if transport_mode.lower() == 'binary' and HIVE_SERVER_THRIFT_PORT_KEY in configurations: port = int(configurations[HIVE_SERVER_THRIFT_PORT_KEY]) elif transport_mode.lower() == 'http' and HIVE_SERVER_THRIFT_HTTP_PORT_KEY in configurations: port = int(configurations[HIVE_SERVER_THRIFT_HTTP_PORT_KEY]) cmd = format("export HIVE_CONF_DIR='{conf_dir}' ; " "beeline -u jdbc:hive2://{host_name}:{port}/\ --hiveconf hive.metastore.client.connect.retry.delay=1\ --hiveconf hive.metastore.failure.retries=1\ --hiveconf hive.metastore.connect.retries=1\ --hiveconf hive.metastore.client.socket.timeout=14\ --hiveconf hive.execution.engine=mr -e 'show databases;'") {code} was: alert_hive_metastore.py will cause orphan processes running over time. Below is one example: {code:none} 1001 593317 593316 0 Dec24 ? 00:00:00 -bash -c export PATH='/usr/sbin:/sbin:/usr/ lib/ambari-server/*:/sbin:/usr/sbin:/bin:/usr/bin:/var/lib/ambari-agent:/bin/:/usr/bin/:/usr/s bin/:/usr/hdp/current/hive-metastore/bin' ; export HIVE_CONF_DIR="/usr/hdp/current/hive-metastore/conf/conf.server" ; hive --hiveconf hive.metastore.uris=thrift://demo.local:9083 --hiveconf hive.metastore.client.connect.retry.delay=1 --hiveconf hive.metastore.failure.retries=1 --hiveconf hive.metastore.connect.retries=1 --hiveconf hive.metastore.client.socket.timeout=14 --hiveconf hive.execution.engine=mr -e "show databases;" {code} There could be thousands of those over many months in the host with Hive Metastore. To check, run below two commands: {code:none} ps -ef | grep "[s]how databases" | wc -l ps h -Led -o user | sort | uniq -c | sort -n {code} This will hit nproc limit and crash other services in the same host. The fixes are: 1. Swap to "hive" user instead of "ambari-qa" user: https://issues.apache.org/jira/browse/AMBARI-22142 2. Change hive CLI to beeline: https://issues.apache.org/jira/browse/AMBARI-17006 For some reasons, the hive CLI processes don't get killed and kept "lingering" around. Proposed fix in /var/lib/ambari-server/resources/common-services/HIVE/0.12.0.2.0/package/alerts Instructions: 1. Add below lines below "HIVE_METASTORE_URIS_KEY = '{{hive-site/hive.metastore.uris}}'" HIVE_SERVER_THRIFT_PORT_KEY = '{{hive-site/hive.server2.thrift.port}}' HIVE_SERVER_THRIFT_HTTP_PORT_KEY = '{{hive-site/hive.server2.thrift.http.port}}' HIVE_SERVER_TRANSPORT_MODE_KEY = '{{hive-site/hive.server2.transport.mode}}' THRIFT_PORT_DEFAULT = 10000 HIVE_SERVER_TRANSPORT_MODE_DEFAULT = 'binary' 2. Change SMOKEUSER_DEFAULT = 'ambari-qa' to: SMOKEUSER_DEFAULT = 'hive' 3. Replace return (SECURITY_ENABLED_KEY,SMOKEUSER_KEYTAB_KEY,SMOKEUSER_PRINCIPAL_KEY, HIVE_METASTORE_URIS_KEY, SMOKEUSER_KEY, KERBEROS_EXECUTABLE_SEARCH_PATHS_KEY, STACK_ROOT) with this: return (SECURITY_ENABLED_KEY,SMOKEUSER_KEYTAB_KEY,SMOKEUSER_PRINCIPAL_KEY, HIVE_METASTORE_URIS_KEY, SMOKEUSER_KEY, KERBEROS_EXECUTABLE_SEARCH_PATHS_KEY, STACK_ROOT, HIVE_SERVER_THRIFT_PORT_KEY, HIVE_SERVER_THRIFT_HTTP_PORT_KEY, HIVE_SERVER_TRANSPORT_MODE_KEY) 4. Replace this return (HIVE_METASTORE_URIS_KEY, HADOOPUSER_KEY) with this: return (HIVE_SERVER_THRIFT_PORT_KEY, HIVE_SERVER_THRIFT_HTTP_PORT_KEY, HIVE_SERVER_TRANSPORT_MODE_KEY, HIVE_METASTORE_URIS_KEY, HADOOPUSER_KEY) 5. Comment out these lines because it will kept injecting ambari-qa user back #if SMOKEUSER_KEY in configurations: # smokeuser = configurations[SMOKEUSER_KEY] 6. Replace this code block: cmd = format("export HIVE_CONF_DIR='{conf_dir}' ; " "hive --hiveconf hive.metastore.uris={metastore_uri}\ --hiveconf hive.metastore.client.connect.retry.delay=1\ --hiveconf hive.metastore.failure.retries=1\ --hiveconf hive.metastore.connect.retries=1\ --hiveconf hive.metastore.client.socket.timeout=14\ --hiveconf hive.execution.engine=mr -e 'show databases;'") with this block: transport_mode = HIVE_SERVER_TRANSPORT_MODE_DEFAULT if HIVE_SERVER_TRANSPORT_MODE_KEY in configurations: transport_mode = configurations[HIVE_SERVER_TRANSPORT_MODE_KEY] port = THRIFT_PORT_DEFAULT if transport_mode.lower() == 'binary' and HIVE_SERVER_THRIFT_PORT_KEY in configurations: port = int(configurations[HIVE_SERVER_THRIFT_PORT_KEY]) elif transport_mode.lower() == 'http' and HIVE_SERVER_THRIFT_HTTP_PORT_KEY in configurations: port = int(configurations[HIVE_SERVER_THRIFT_HTTP_PORT_KEY]) cmd = format("export HIVE_CONF_DIR='{conf_dir}' ; " "beeline -u jdbc:hive2://{host_name}:{port}/\ --hiveconf hive.metastore.client.connect.retry.delay=1\ --hiveconf hive.metastore.failure.retries=1\ --hiveconf hive.metastore.connect.retries=1\ --hiveconf hive.metastore.client.socket.timeout=14\ --hiveconf hive.execution.engine=mr -e 'show databases;'") > hive CLI process leak on metastore alert > ---------------------------------------- > > Key: AMBARI-22701 > URL: https://issues.apache.org/jira/browse/AMBARI-22701 > Project: Ambari > Issue Type: Bug > Components: alerts > Affects Versions: 2.4.0 > Environment: CentOS 6.9 > Ambari 2.4.0.1 > Hortonworks Hadoop 2.5.0.0-1245 > Hive installed > Tez installed > Reporter: Hoc Phan > > alert_hive_metastore.py will cause orphan processes running over time. Below > is one example: > {code:none} > 1001 593317 593316 0 Dec24 ? 00:00:00 -bash -c export > PATH='/usr/sbin:/sbin:/usr/ > lib/ambari-server/*:/sbin:/usr/sbin:/bin:/usr/bin:/var/lib/ambari-agent:/bin/:/usr/bin/:/usr/s > bin/:/usr/hdp/current/hive-metastore/bin' ; export > HIVE_CONF_DIR="/usr/hdp/current/hive-metastore/conf/conf.server" ; hive > --hiveconf hive.metastore.uris=thrift://demo.local:9083 > --hiveconf hive.metastore.client.connect.retry.delay=1 > --hiveconf hive.metastore.failure.retries=1 --hiveconf > hive.metastore.connect.retries=1 --hiveconf > hive.metastore.client.socket.timeout=14 --hiveconf > hive.execution.engine=mr -e "show databases;" > {code} > There could be thousands of those over many months in the host with Hive > Metastore. To check, run below two commands: > {code:none} > ps -ef | grep "[s]how databases" | wc -l > ps h -Led -o user | sort | uniq -c | sort -n > {code} > This will hit nproc limit and crash other services in the same host. > The fixes are: > 1. Swap to "hive" user instead of "ambari-qa" user: > https://issues.apache.org/jira/browse/AMBARI-22142 > 2. Change hive CLI to beeline: > https://issues.apache.org/jira/browse/AMBARI-17006 > For some reasons, the hive CLI processes don't get killed and kept > "lingering" around. > Proposed fix in > /var/lib/ambari-server/resources/common-services/HIVE/0.12.0.2.0/package/alerts > Instructions: > 1. Add below lines below "HIVE_METASTORE_URIS_KEY = > '{{hive-site/hive.metastore.uris}}'" > {code:none} > HIVE_SERVER_THRIFT_PORT_KEY = '{{hive-site/hive.server2.thrift.port}}' > HIVE_SERVER_THRIFT_HTTP_PORT_KEY = > '{{hive-site/hive.server2.thrift.http.port}}' > HIVE_SERVER_TRANSPORT_MODE_KEY = '{{hive-site/hive.server2.transport.mode}}' > THRIFT_PORT_DEFAULT = 10000 > HIVE_SERVER_TRANSPORT_MODE_DEFAULT = 'binary' > {code} > 2. Change SMOKEUSER_DEFAULT = 'ambari-qa' to: > {code:none} > SMOKEUSER_DEFAULT = 'hive' > {code} > 3. Replace > {code:none} > return (SECURITY_ENABLED_KEY,SMOKEUSER_KEYTAB_KEY,SMOKEUSER_PRINCIPAL_KEY, > HIVE_METASTORE_URIS_KEY, SMOKEUSER_KEY, KERBEROS_EXECUTABLE_SEARCH_PATHS_KEY, > STACK_ROOT) > {code} > with this: > {code:none} > return (SECURITY_ENABLED_KEY,SMOKEUSER_KEYTAB_KEY,SMOKEUSER_PRINCIPAL_KEY, > HIVE_METASTORE_URIS_KEY, SMOKEUSER_KEY, KERBEROS_EXECUTABLE_SEARCH_PATHS_KEY, > STACK_ROOT, HIVE_SERVER_THRIFT_PORT_KEY, HIVE_SERVER_THRIFT_HTTP_PORT_KEY, > HIVE_SERVER_TRANSPORT_MODE_KEY) > {code} > 4. Replace this > {code:none} > return (HIVE_METASTORE_URIS_KEY, HADOOPUSER_KEY) > {code} > with this: > {code:none} > return (HIVE_SERVER_THRIFT_PORT_KEY, HIVE_SERVER_THRIFT_HTTP_PORT_KEY, > HIVE_SERVER_TRANSPORT_MODE_KEY, HIVE_METASTORE_URIS_KEY, HADOOPUSER_KEY) > {code} > 5. Comment out these lines because it will kept injecting ambari-qa user back > {code:none} > #if SMOKEUSER_KEY in configurations: > # smokeuser = configurations[SMOKEUSER_KEY] > {code} > 6. Replace this code block: > {code:none} > cmd = format("export HIVE_CONF_DIR='{conf_dir}' ; " > "hive --hiveconf hive.metastore.uris={metastore_uri}\ > --hiveconf hive.metastore.client.connect.retry.delay=1\ > --hiveconf hive.metastore.failure.retries=1\ > --hiveconf hive.metastore.connect.retries=1\ > --hiveconf hive.metastore.client.socket.timeout=14\ > --hiveconf hive.execution.engine=mr -e 'show databases;'") > {code} > with this block: > {code:none} > transport_mode = HIVE_SERVER_TRANSPORT_MODE_DEFAULT > if HIVE_SERVER_TRANSPORT_MODE_KEY in configurations: > transport_mode = configurations[HIVE_SERVER_TRANSPORT_MODE_KEY] > port = THRIFT_PORT_DEFAULT > if transport_mode.lower() == 'binary' and HIVE_SERVER_THRIFT_PORT_KEY in > configurations: > port = int(configurations[HIVE_SERVER_THRIFT_PORT_KEY]) > elif transport_mode.lower() == 'http' and > HIVE_SERVER_THRIFT_HTTP_PORT_KEY in configurations: > port = int(configurations[HIVE_SERVER_THRIFT_HTTP_PORT_KEY]) > cmd = format("export HIVE_CONF_DIR='{conf_dir}' ; " > "beeline -u jdbc:hive2://{host_name}:{port}/\ > --hiveconf hive.metastore.client.connect.retry.delay=1\ > --hiveconf hive.metastore.failure.retries=1\ > --hiveconf hive.metastore.connect.retries=1\ > --hiveconf hive.metastore.client.socket.timeout=14\ > --hiveconf hive.execution.engine=mr -e 'show databases;'") > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)