[jira] [Commented] (SPARK-32937) DecomissionSuite in k8s integration tests is failing.

Hyukjin Kwon (Jira) Sat, 19 Sep 2020 19:46:59 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-32937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17198866#comment-17198866
 ]


Hyukjin Kwon commented on SPARK-32937:
--------------------------------------

This is a usual approach to make a followup if that can be easily done when 
it's not the part of the regular PR builder. I can give you a lot of references.
If you think this is not the correct path forward in the dev branch, we'll have 
to discuss and revert PRs whenever the test fails even when it's not in the 
regular PR builder.

If this specific test failure looks not the case, please go ahead and revert 
[~holden] after commenting in the original PR.

> DecomissionSuite in k8s integration tests is failing.
> -----------------------------------------------------
>
>                 Key: SPARK-32937
>                 URL: https://issues.apache.org/jira/browse/SPARK-32937
>             Project: Spark
>          Issue Type: Bug
>          Components: Kubernetes
>    Affects Versions: 3.1.0
>            Reporter: Prashant Sharma
>            Priority: Major
>
> Logs from the failing test, copied from jenkins. As of now, it is always 
> failing. 
> {code}
> - Test basic decommissioning *** FAILED ***
>   The code passed to eventually never returned normally. Attempted 182 times 
> over 3.00377927275 minutes. Last failure message: "++ id -u
>   + myuid=185
>   ++ id -g
>   + mygid=0
>   + set +e
>   ++ getent passwd 185
>   + uidentry=
>   + set -e
>   + '[' -z '' ']'
>   + '[' -w /etc/passwd ']'
>   + echo '185:x:185:0:anonymous uid:/opt/spark:/bin/false'
>   + SPARK_CLASSPATH=':/opt/spark/jars/*'
>   + env
>   + grep SPARK_JAVA_OPT_
>   + sort -t_ -k4 -n
>   + sed 's/[^=]*=\(.*\)/\1/g'
>   + readarray -t SPARK_EXECUTOR_JAVA_OPTS
>   + '[' -n '' ']'
>   + '[' 3 == 2 ']'
>   + '[' 3 == 3 ']'
>   ++ python3 -V
>   + pyv3='Python 3.7.3'
>   + export PYTHON_VERSION=3.7.3
>   + PYTHON_VERSION=3.7.3
>   + export PYSPARK_PYTHON=python3
>   + PYSPARK_PYTHON=python3
>   + export PYSPARK_DRIVER_PYTHON=python3
>   + PYSPARK_DRIVER_PYTHON=python3
>   + '[' -n '' ']'
>   + '[' -z ']'
>   + '[' -z x ']'
>   + SPARK_CLASSPATH='/opt/spark/conf::/opt/spark/jars/*'
>   + case "$1" in
>   + shift 1
>   + CMD=("$SPARK_HOME/bin/spark-submit" --conf 
> "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client 
> "$@")
>   + exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf 
> spark.driver.bindAddress=172.17.0.4 --deploy-mode client --properties-file 
> /opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner 
> local:///opt/spark/tests/decommissioning.py
>   20/09/17 11:06:56 WARN NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
>   Starting decom test
>   Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
>   20/09/17 11:06:56 INFO SparkContext: Running Spark version 3.1.0-SNAPSHOT
>   20/09/17 11:06:57 INFO ResourceUtils: 
> ==============================================================
>   20/09/17 11:06:57 INFO ResourceUtils: No custom resources configured for 
> spark.driver.
>   20/09/17 11:06:57 INFO ResourceUtils: 
> ==============================================================
>   20/09/17 11:06:57 INFO SparkContext: Submitted application: PyMemoryTest
>   20/09/17 11:06:57 INFO ResourceProfile: Default ResourceProfile created, 
> executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , 
> memory -> name: memory, amount: 1024, script: , vendor: ), task resources: 
> Map(cpus -> name: cpus, amount: 1.0)
>   20/09/17 11:06:57 INFO ResourceProfile: Limiting resource is cpus at 1 
> tasks per executor
>   20/09/17 11:06:57 INFO ResourceProfileManager: Added ResourceProfile id: 0
>   20/09/17 11:06:57 INFO SecurityManager: Changing view acls to: 185,jenkins
>   20/09/17 11:06:57 INFO SecurityManager: Changing modify acls to: 185,jenkins
>   20/09/17 11:06:57 INFO SecurityManager: Changing view acls groups to: 
>   20/09/17 11:06:57 INFO SecurityManager: Changing modify acls groups to: 
>   20/09/17 11:06:57 INFO SecurityManager: SecurityManager: authentication 
> enabled; ui acls disabled; users  with view permissions: Set(185, jenkins); 
> groups with view permissions: Set(); users  with modify permissions: Set(185, 
> jenkins); groups with modify permissions: Set()
>   20/09/17 11:06:57 INFO Utils: Successfully started service 'sparkDriver' on 
> port 7078.
>   20/09/17 11:06:57 INFO SparkEnv: Registering MapOutputTracker
>   20/09/17 11:06:57 INFO SparkEnv: Registering BlockManagerMaster
>   20/09/17 11:06:57 INFO BlockManagerMasterEndpoint: Using 
> org.apache.spark.storage.DefaultTopologyMapper for getting topology 
> information
>   20/09/17 11:06:57 INFO BlockManagerMasterEndpoint: 
> BlockManagerMasterEndpoint up
>   20/09/17 11:06:57 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
>   20/09/17 11:06:57 INFO DiskBlockManager: Created local directory at 
> /var/data/spark-7985c075-3b02-42ec-9111-cefba535adf0/blockmgr-3bd403d0-6689-46be-997e-5bc699ecefd3
>   20/09/17 11:06:57 INFO MemoryStore: MemoryStore started with capacity 593.9 
> MiB
>   20/09/17 11:06:57 INFO SparkEnv: Registering OutputCommitCoordinator
>   20/09/17 11:06:58 INFO Utils: Successfully started service 'SparkUI' on 
> port 4040.
>   20/09/17 11:06:58 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at 
> http://spark-test-app-08853d749bbee080-driver-svc.a0af92633bef4a91b5f7e262e919afd9.svc:4040
>   20/09/17 11:06:58 INFO SparkKubernetesClientFactory: Auto-configuring K8S 
> client using current context from users K8S config file
>   20/09/17 11:06:59 INFO ExecutorPodsAllocator: Going to request 3 executors 
> from Kubernetes.
>   20/09/17 11:06:59 INFO KubernetesClientUtils: Spark configuration files 
> loaded from Some(/opt/spark/conf) : 
>   20/09/17 11:07:00 INFO BasicExecutorFeatureStep: Adding decommission script 
> to lifecycle
>   20/09/17 11:07:00 INFO Utils: Successfully started service 
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 7079.
>   20/09/17 11:07:00 INFO NettyBlockTransferService: Server created on 
> spark-test-app-08853d749bbee080-driver-svc.a0af92633bef4a91b5f7e262e919afd9.svc:7079
>   20/09/17 11:07:00 INFO BlockManager: Using 
> org.apache.spark.storage.RandomBlockReplicationPolicy for block replication 
> policy
>   20/09/17 11:07:00 INFO BlockManagerMaster: Registering BlockManager 
> BlockManagerId(driver, 
> spark-test-app-08853d749bbee080-driver-svc.a0af92633bef4a91b5f7e262e919afd9.svc,
>  7079, None)
>   20/09/17 11:07:00 INFO BlockManagerMasterEndpoint: Registering block 
> manager 
> spark-test-app-08853d749bbee080-driver-svc.a0af92633bef4a91b5f7e262e919afd9.svc:7079
>  with 593.9 MiB RAM, BlockManagerId(driver, 
> spark-test-app-08853d749bbee080-driver-svc.a0af92633bef4a91b5f7e262e919afd9.svc,
>  7079, None)
>   20/09/17 11:07:00 INFO BlockManagerMaster: Registered BlockManager 
> BlockManagerId(driver, 
> spark-test-app-08853d749bbee080-driver-svc.a0af92633bef4a91b5f7e262e919afd9.svc,
>  7079, None)
>   20/09/17 11:07:00 INFO KubernetesClientUtils: Spark configuration files 
> loaded from Some(/opt/spark/conf) : 
>   20/09/17 11:07:00 INFO BlockManager: Initialized BlockManager: 
> BlockManagerId(driver, 
> spark-test-app-08853d749bbee080-driver-svc.a0af92633bef4a91b5f7e262e919afd9.svc,
>  7079, None)
>   20/09/17 11:07:00 INFO BasicExecutorFeatureStep: Adding decommission script 
> to lifecycle
>   20/09/17 11:07:00 INFO KubernetesClientUtils: Spark configuration files 
> loaded from Some(/opt/spark/conf) : 
>   20/09/17 11:07:00 INFO BasicExecutorFeatureStep: Adding decommission script 
> to lifecycle
>   20/09/17 11:07:00 INFO KubernetesClientUtils: Spark configuration files 
> loaded from Some(/opt/spark/conf) : 
>   20/09/17 11:07:05 INFO 
> KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Registered 
> executor NettyRpcEndpointRef(spark-client://Executor) (172.17.0.6:50176) with 
> ID 2,  ResourceProfileId 0
>   20/09/17 11:07:05 INFO 
> KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Registered 
> executor NettyRpcEndpointRef(spark-client://Executor) (172.17.0.5:35624) with 
> ID 1,  ResourceProfileId 0
>   20/09/17 11:07:05 INFO BlockManagerMasterEndpoint: Registering block 
> manager 172.17.0.6:33547 with 593.9 MiB RAM, BlockManagerId(2, 172.17.0.6, 
> 33547, None)
>   20/09/17 11:07:05 INFO BlockManagerMasterEndpoint: Registering block 
> manager 172.17.0.5:46327 with 593.9 MiB RAM, BlockManagerId(1, 172.17.0.5, 
> 46327, None)
>   20/09/17 11:07:29 INFO KubernetesClusterSchedulerBackend: SchedulerBackend 
> is ready for scheduling beginning after waiting 
> maxRegisteredResourcesWaitingTime: 30000000000(ns)
>   20/09/17 11:07:30 INFO SharedState: Setting hive.metastore.warehouse.dir 
> ('null') to the value of spark.sql.warehouse.dir 
> ('file:/opt/spark/work-dir/spark-warehouse').
>   20/09/17 11:07:30 INFO SharedState: Warehouse path is 
> 'file:/opt/spark/work-dir/spark-warehouse'.
>   20/09/17 11:07:32 INFO SparkContext: Starting job: collect at 
> /opt/spark/tests/decommissioning.py:44
>   20/09/17 11:07:32 INFO DAGScheduler: Registering RDD 2 (groupByKey at 
> /opt/spark/tests/decommissioning.py:43) as input to shuffle 0
>   20/09/17 11:07:32 INFO DAGScheduler: Got job 0 (collect at 
> /opt/spark/tests/decommissioning.py:44) with 5 output partitions
>   20/09/17 11:07:32 INFO DAGScheduler: Final stage: ResultStage 1 (collect at 
> /opt/spark/tests/decommissioning.py:44)
>   20/09/17 11:07:32 INFO DAGScheduler: Parents of final stage: 
> List(ShuffleMapStage 0)
>   20/09/17 11:07:32 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 
> 0)
>   20/09/17 11:07:32 INFO DAGScheduler: Submitting ShuffleMapStage 0 
> (PairwiseRDD[2] at groupByKey at /opt/spark/tests/decommissioning.py:43), 
> which has no missing parents
>   20/09/17 11:07:32 INFO MemoryStore: Block broadcast_0 stored as values in 
> memory (estimated size 10.6 KiB, free 593.9 MiB)
>   20/09/17 11:07:32 INFO MemoryStore: Block broadcast_0_piece0 stored as 
> bytes in memory (estimated size 6.5 KiB, free 593.9 MiB)
>   20/09/17 11:07:32 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on 
> spark-test-app-08853d749bbee080-driver-svc.a0af92633bef4a91b5f7e262e919afd9.svc:7079
>  (size: 6.5 KiB, free: 593.9 MiB)
>   20/09/17 11:07:32 INFO SparkContext: Created broadcast 0 from broadcast at 
> DAGScheduler.scala:1348
>   20/09/17 11:07:32 INFO DAGScheduler: Submitting 5 missing tasks from 
> ShuffleMapStage 0 (PairwiseRDD[2] at groupByKey at 
> /opt/spark/tests/decommissioning.py:43) (first 15 tasks are for partitions 
> Vector(0, 1, 2, 3, 4))
>   20/09/17 11:07:32 INFO TaskSchedulerImpl: Adding task set 0.0 with 5 tasks 
> resource profile 0
>   20/09/17 11:07:32 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 
> 0) (172.17.0.6, executor 2, partition 0, PROCESS_LOCAL, 7341 bytes) 
> taskResourceAssignments Map()
>   20/09/17 11:07:32 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 
> 1) (172.17.0.5, executor 1, partition 1, PROCESS_LOCAL, 7341 bytes) 
> taskResourceAssignments Map()
>   20/09/17 11:07:32 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on 172.17.0.5:46327 (size: 6.5 KiB, free: 593.9 MiB)
>   20/09/17 11:07:32 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on 172.17.0.6:33547 (size: 6.5 KiB, free: 593.9 MiB)
>   20/09/17 11:07:34 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 
> 2) (172.17.0.5, executor 1, partition 2, PROCESS_LOCAL, 7341 bytes) 
> taskResourceAssignments Map()
>   20/09/17 11:07:34 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 
> 1) in 1825 ms on 172.17.0.5 (executor 1) (1/5)
>   20/09/17 11:07:34 INFO PythonAccumulatorV2: Connected to AccumulatorServer 
> at host: 127.0.0.1 port: 47109
>   20/09/17 11:07:34 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 
> 3) (172.17.0.6, executor 2, partition 3, PROCESS_LOCAL, 7341 bytes) 
> taskResourceAssignments Map()
>   20/09/17 11:07:34 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 
> 0) in 1960 ms on 172.17.0.6 (executor 2) (2/5)
>   20/09/17 11:07:34 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID 
> 4) (172.17.0.5, executor 1, partition 4, PROCESS_LOCAL, 7341 bytes) 
> taskResourceAssignments Map()
>   20/09/17 11:07:34 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 
> 2) in 136 ms on 172.17.0.5 (executor 1) (3/5)
>   20/09/17 11:07:34 INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID 
> 3) in 119 ms on 172.17.0.6 (executor 2) (4/5)
>   20/09/17 11:07:34 INFO TaskSetManager: Finished task 4.0 in stage 0.0 (TID 
> 4) in 117 ms on 172.17.0.5 (executor 1) (5/5)
>   20/09/17 11:07:34 INFO DAGScheduler: ShuffleMapStage 0 (groupByKey at 
> /opt/spark/tests/decommissioning.py:43) finished in 2.352 s
>   20/09/17 11:07:34 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks 
> have all completed, from pool 
>   20/09/17 11:07:34 INFO DAGScheduler: looking for newly runnable stages
>   20/09/17 11:07:34 INFO DAGScheduler: running: Set()
>   20/09/17 11:07:34 INFO DAGScheduler: waiting: Set(ResultStage 1)
>   20/09/17 11:07:34 INFO DAGScheduler: failed: Set()
>   20/09/17 11:07:34 INFO DAGScheduler: Submitting ResultStage 1 (PythonRDD[5] 
> at collect at /opt/spark/tests/decommissioning.py:44), which has no missing 
> parents
>   20/09/17 11:07:34 INFO MemoryStore: Block broadcast_1 stored as values in 
> memory (estimated size 9.3 KiB, free 593.9 MiB)
>   20/09/17 11:07:34 INFO MemoryStore: Block broadcast_1_piece0 stored as 
> bytes in memory (estimated size 5.4 KiB, free 593.9 MiB)
>   20/09/17 11:07:34 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory 
> on 
> spark-test-app-08853d749bbee080-driver-svc.a0af92633bef4a91b5f7e262e919afd9.svc:7079
>  (size: 5.4 KiB, free: 593.9 MiB)
>   20/09/17 11:07:34 INFO SparkContext: Created broadcast 1 from broadcast at 
> DAGScheduler.scala:1348
>   20/09/17 11:07:34 INFO DAGScheduler: Submitting 5 missing tasks from 
> ResultStage 1 (PythonRDD[5] at collect at 
> /opt/spark/tests/decommissioning.py:44) (first 15 tasks are for partitions 
> Vector(0, 1, 2, 3, 4))
>   20/09/17 11:07:34 INFO TaskSchedulerImpl: Adding task set 1.0 with 5 tasks 
> resource profile 0
>   20/09/17 11:07:34 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 
> 5) (172.17.0.6, executor 2, partition 0, NODE_LOCAL, 7162 bytes) 
> taskResourceAssignments Map()
>   20/09/17 11:07:34 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 
> 6) (172.17.0.5, executor 1, partition 1, NODE_LOCAL, 7162 bytes) 
> taskResourceAssignments Map()
>   20/09/17 11:07:34 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory 
> on 172.17.0.6:33547 (size: 5.4 KiB, free: 593.9 MiB)
>   20/09/17 11:07:34 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory 
> on 172.17.0.5:46327 (size: 5.4 KiB, free: 593.9 MiB)
>   20/09/17 11:07:34 INFO MapOutputTrackerMasterEndpoint: Asked to send map 
> output locations for shuffle 0 to 172.17.0.5:35624
>   20/09/17 11:07:34 INFO MapOutputTrackerMasterEndpoint: Asked to send map 
> output locations for shuffle 0 to 172.17.0.6:50176
>   20/09/17 11:07:35 INFO TaskSetManager: Starting task 2.0 in stage 1.0 (TID 
> 7) (172.17.0.6, executor 2, partition 2, NODE_LOCAL, 7162 bytes) 
> taskResourceAssignments Map()
>   20/09/17 11:07:35 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 
> 5) in 729 ms on 172.17.0.6 (executor 2) (1/5)
>   20/09/17 11:07:35 INFO TaskSetManager: Starting task 3.0 in stage 1.0 (TID 
> 8) (172.17.0.5, executor 1, partition 3, NODE_LOCAL, 7162 bytes) 
> taskResourceAssignments Map()
>   20/09/17 11:07:35 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 
> 6) in 755 ms on 172.17.0.5 (executor 1) (2/5)
>   20/09/17 11:07:35 INFO TaskSetManager: Starting task 4.0 in stage 1.0 (TID 
> 9) (172.17.0.6, executor 2, partition 4, NODE_LOCAL, 7162 bytes) 
> taskResourceAssignments Map()
>   20/09/17 11:07:35 INFO TaskSetManager: Finished task 2.0 in stage 1.0 (TID 
> 7) in 113 ms on 172.17.0.6 (executor 2) (3/5)
>   20/09/17 11:07:35 INFO TaskSetManager: Finished task 3.0 in stage 1.0 (TID 
> 8) in 104 ms on 172.17.0.5 (executor 1) (4/5)
>   20/09/17 11:07:35 INFO TaskSetManager: Finished task 4.0 in stage 1.0 (TID 
> 9) in 82 ms on 172.17.0.6 (executor 2) (5/5)
>   20/09/17 11:07:35 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks 
> have all completed, from pool 
>   20/09/17 11:07:35 INFO DAGScheduler: ResultStage 1 (collect at 
> /opt/spark/tests/decommissioning.py:44) finished in 0.943 s
>   20/09/17 11:07:35 INFO DAGScheduler: Job 0 is finished. Cancelling 
> potential speculative or zombie tasks for this job
>   20/09/17 11:07:35 INFO TaskSchedulerImpl: Killing all running tasks in 
> stage 1: Stage finished
>   20/09/17 11:07:35 INFO DAGScheduler: Job 0 finished: collect at 
> /opt/spark/tests/decommissioning.py:44, took 3.420388 s
>   1st accumulator value is: 100
>   Waiting to give nodes time to finish migration, decom exec 1.
>   ...
>   20/09/17 11:07:36 WARN 
> KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Received executor 
> 1 decommissioned message
>   20/09/17 11:07:36 INFO ShuffleStatus: Updating map output for 4 to 
> BlockManagerId(2, 172.17.0.6, 33547, None)
>   20/09/17 11:07:36 INFO ShuffleStatus: Updating map output for 1 to 
> BlockManagerId(2, 172.17.0.6, 33547, None)
>   20/09/17 11:07:36 INFO ShuffleStatus: Updating map output for 2 to 
> BlockManagerId(2, 172.17.0.6, 33547, None)
>   20/09/17 11:07:36 INFO BlockManagerInfo: Removed broadcast_1_piece0 on 
> spark-test-app-08853d749bbee080-driver-svc.a0af92633bef4a91b5f7e262e919afd9.svc:7079
>  in memory (size: 5.4 KiB, free: 593.9 MiB)
>   20/09/17 11:07:36 INFO BlockManagerInfo: Removed broadcast_1_piece0 on 
> 172.17.0.5:46327 in memory (size: 5.4 KiB, free: 593.9 MiB)
>   20/09/17 11:07:36 INFO BlockManagerInfo: Removed broadcast_1_piece0 on 
> 172.17.0.6:33547 in memory (size: 5.4 KiB, free: 593.9 MiB)
>   20/09/17 11:07:37 ERROR TaskSchedulerImpl: Lost executor 1 on 172.17.0.5: 
> Executor decommission.
>   20/09/17 11:07:37 INFO DAGScheduler: Executor lost: 1 (epoch 1)
>   20/09/17 11:07:37 INFO BlockManagerMasterEndpoint: Trying to remove 
> executor 1 from BlockManagerMaster.
>   20/09/17 11:07:37 INFO BlockManagerMasterEndpoint: Removing block manager 
> BlockManagerId(1, 172.17.0.5, 46327, None)
>   20/09/17 11:07:37 INFO BlockManagerMaster: Removed 1 successfully in 
> removeExecutor
>   20/09/17 11:07:37 INFO DAGScheduler: Shuffle files lost for executor: 1 
> (epoch 1)
>   20/09/17 11:07:41 INFO ExecutorPodsAllocator: Going to request 1 executors 
> from Kubernetes.
>   20/09/17 11:07:41 INFO BasicExecutorFeatureStep: Adding decommission script 
> to lifecycle
>   20/09/17 11:07:41 INFO KubernetesClientUtils: Spark configuration files 
> loaded from Some(/opt/spark/conf) : 
>   20/09/17 11:07:43 INFO 
> KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Registered 
> executor NettyRpcEndpointRef(spark-client://Executor) (172.17.0.5:35848) with 
> ID 3,  ResourceProfileId 0
>   20/09/17 11:07:43 INFO BlockManagerMasterEndpoint: Registering block 
> manager 172.17.0.5:34299 with 593.9 MiB RAM, BlockManagerId(3, 172.17.0.5, 
> 34299, None)
>   20/09/17 11:08:05 INFO SparkContext: Starting job: count at 
> /opt/spark/tests/decommissioning.py:49
>   20/09/17 11:08:05 INFO DAGScheduler: Got job 1 (count at 
> /opt/spark/tests/decommissioning.py:49) with 5 output partitions
>   20/09/17 11:08:05 INFO DAGScheduler: Final stage: ResultStage 3 (count at 
> /opt/spark/tests/decommissioning.py:49)
>   20/09/17 11:08:05 INFO DAGScheduler: Parents of final stage: 
> List(ShuffleMapStage 2)
>   20/09/17 11:08:05 INFO DAGScheduler: Missing parents: List()
>   20/09/17 11:08:05 INFO DAGScheduler: Submitting ResultStage 3 (PythonRDD[6] 
> at count at /opt/spark/tests/decommissioning.py:49), which has no missing 
> parents
>   20/09/17 11:08:05 INFO MemoryStore: Block broadcast_2 stored as values in 
> memory (estimated size 10.6 KiB, free 593.9 MiB)
>   20/09/17 11:08:05 INFO MemoryStore: Block broadcast_2_piece0 stored as 
> bytes in memory (estimated size 5.9 KiB, free 593.9 MiB)
>   20/09/17 11:08:05 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory 
> on 
> spark-test-app-08853d749bbee080-driver-svc.a0af92633bef4a91b5f7e262e919afd9.svc:7079
>  (size: 5.9 KiB, free: 593.9 MiB)
>   20/09/17 11:08:05 INFO SparkContext: Created broadcast 2 from broadcast at 
> DAGScheduler.scala:1348
>   20/09/17 11:08:05 INFO DAGScheduler: Submitting 5 missing tasks from 
> ResultStage 3 (PythonRDD[6] at count at 
> /opt/spark/tests/decommissioning.py:49) (first 15 tasks are for partitions 
> Vector(0, 1, 2, 3, 4))
>   20/09/17 11:08:05 INFO TaskSchedulerImpl: Adding task set 3.0 with 5 tasks 
> resource profile 0
>   20/09/17 11:08:05 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 
> 10) (172.17.0.6, executor 2, partition 0, NODE_LOCAL, 7162 bytes) 
> taskResourceAssignments Map()
>   20/09/17 11:08:05 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory 
> on 172.17.0.6:33547 (size: 5.9 KiB, free: 593.9 MiB)
>   20/09/17 11:08:05 INFO MapOutputTrackerMasterEndpoint: Asked to send map 
> output locations for shuffle 0 to 172.17.0.6:50176
>   20/09/17 11:08:05 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 
> 10) in 133 ms on 172.17.0.6 (executor 2) (1/5)
>   20/09/17 11:08:05 INFO TaskSetManager: Starting task 1.0 in stage 3.0 (TID 
> 11) (172.17.0.6, executor 2, partition 1, NODE_LOCAL, 7162 bytes) 
> taskResourceAssignments Map()
>   20/09/17 11:08:05 INFO TaskSetManager: Starting task 2.0 in stage 3.0 (TID 
> 12) (172.17.0.6, executor 2, partition 2, NODE_LOCAL, 7162 bytes) 
> taskResourceAssignments Map()
>   20/09/17 11:08:05 INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID 
> 11) in 81 ms on 172.17.0.6 (executor 2) (2/5)
>   20/09/17 11:08:05 INFO TaskSetManager: Starting task 3.0 in stage 3.0 (TID 
> 13) (172.17.0.6, executor 2, partition 3, NODE_LOCAL, 7162 bytes) 
> taskResourceAssignments Map()
>   20/09/17 11:08:05 INFO TaskSetManager: Finished task 2.0 in stage 3.0 (TID 
> 12) in 85 ms on 172.17.0.6 (executor 2) (3/5)
>   20/09/17 11:08:05 INFO TaskSetManager: Starting task 4.0 in stage 3.0 (TID 
> 14) (172.17.0.6, executor 2, partition 4, NODE_LOCAL, 7162 bytes) 
> taskResourceAssignments Map()
>   20/09/17 11:08:05 INFO TaskSetManager: Finished task 3.0 in stage 3.0 (TID 
> 13) in 73 ms on 172.17.0.6 (executor 2) (4/5)
>   20/09/17 11:08:06 INFO TaskSetManager: Finished task 4.0 in stage 3.0 (TID 
> 14) in 91 ms on 172.17.0.6 (executor 2) (5/5)
>   20/09/17 11:08:06 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks 
> have all completed, from pool 
>   20/09/17 11:08:06 INFO DAGScheduler: ResultStage 3 (count at 
> /opt/spark/tests/decommissioning.py:49) finished in 0.478 s
>   20/09/17 11:08:06 INFO DAGScheduler: Job 1 is finished. Cancelling 
> potential speculative or zombie tasks for this job
>   20/09/17 11:08:06 INFO TaskSchedulerImpl: Killing all running tasks in 
> stage 3: Stage finished
>   20/09/17 11:08:06 INFO DAGScheduler: Job 1 finished: count at 
> /opt/spark/tests/decommissioning.py:49, took 0.489355 s
>   20/09/17 11:08:06 INFO SparkContext: Starting job: collect at 
> /opt/spark/tests/decommissioning.py:50
>   20/09/17 11:08:06 INFO DAGScheduler: Got job 2 (collect at 
> /opt/spark/tests/decommissioning.py:50) with 5 output partitions
>   20/09/17 11:08:06 INFO DAGScheduler: Final stage: ResultStage 5 (collect at 
> /opt/spark/tests/decommissioning.py:50)
>   20/09/17 11:08:06 INFO DAGScheduler: Parents of final stage: 
> List(ShuffleMapStage 4)
>   20/09/17 11:08:06 INFO DAGScheduler: Missing parents: List()
>   20/09/17 11:08:06 INFO BlockManagerInfo: Removed broadcast_2_piece0 on 
> spark-test-app-08853d749bbee080-driver-svc.a0af92633bef4a91b5f7e262e919afd9.svc:7079
>  in memory (size: 5.9 KiB, free: 593.9 MiB)
>   20/09/17 11:08:06 INFO DAGScheduler: Submitting ResultStage 5 (PythonRDD[5] 
> at collect at /opt/spark/tests/decommissioning.py:44), which has no missing 
> parents
>   20/09/17 11:08:06 INFO BlockManagerInfo: Removed broadcast_2_piece0 on 
> 172.17.0.6:33547 in memory (size: 5.9 KiB, free: 593.9 MiB)
>   20/09/17 11:08:06 INFO MemoryStore: Block broadcast_3 stored as values in 
> memory (estimated size 9.3 KiB, free 593.9 MiB)
>   20/09/17 11:08:06 INFO MemoryStore: Block broadcast_3_piece0 stored as 
> bytes in memory (estimated size 5.4 KiB, free 593.9 MiB)
>   20/09/17 11:08:06 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory 
> on 
> spark-test-app-08853d749bbee080-driver-svc.a0af92633bef4a91b5f7e262e919afd9.svc:7079
>  (size: 5.4 KiB, free: 593.9 MiB)
>   20/09/17 11:08:06 INFO SparkContext: Created broadcast 3 from broadcast at 
> DAGScheduler.scala:1348
>   20/09/17 11:08:06 INFO DAGScheduler: Submitting 5 missing tasks from 
> ResultStage 5 (PythonRDD[5] at collect at 
> /opt/spark/tests/decommissioning.py:44) (first 15 tasks are for partitions 
> Vector(0, 1, 2, 3, 4))
>   20/09/17 11:08:06 INFO TaskSchedulerImpl: Adding task set 5.0 with 5 tasks 
> resource profile 0
>   20/09/17 11:08:06 INFO TaskSetManager: Starting task 0.0 in stage 5.0 (TID 
> 15) (172.17.0.6, executor 2, partition 0, NODE_LOCAL, 7162 bytes) 
> taskResourceAssignments Map()
>   20/09/17 11:08:06 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory 
> on 172.17.0.6:33547 (size: 5.4 KiB, free: 593.9 MiB)
>   20/09/17 11:08:06 INFO TaskSetManager: Starting task 1.0 in stage 5.0 (TID 
> 16) (172.17.0.6, executor 2, partition 1, NODE_LOCAL, 7162 bytes) 
> taskResourceAssignments Map()
>   20/09/17 11:08:06 INFO TaskSetManager: Finished task 0.0 in stage 5.0 (TID 
> 15) in 105 ms on 172.17.0.6 (executor 2) (1/5)
>   20/09/17 11:08:06 INFO TaskSetManager: Finished task 1.0 in stage 5.0 (TID 
> 16) in 84 ms on 172.17.0.6 (executor 2) (2/5)
>   20/09/17 11:08:06 INFO TaskSetManager: Starting task 2.0 in stage 5.0 (TID 
> 17) (172.17.0.6, executor 2, partition 2, NODE_LOCAL, 7162 bytes) 
> taskResourceAssignments Map()
>   20/09/17 11:08:06 INFO TaskSetManager: Starting task 3.0 in stage 5.0 (TID 
> 18) (172.17.0.6, executor 2, partition 3, NODE_LOCAL, 7162 bytes) 
> taskResourceAssignments Map()
>   20/09/17 11:08:06 INFO TaskSetManager: Finished task 2.0 in stage 5.0 (TID 
> 17) in 76 ms on 172.17.0.6 (executor 2) (3/5)
>   20/09/17 11:08:06 INFO TaskSetManager: Starting task 4.0 in stage 5.0 (TID 
> 19) (172.17.0.6, executor 2, partition 4, NODE_LOCAL, 7162 bytes) 
> taskResourceAssignments Map()
>   20/09/17 11:08:06 INFO TaskSetManager: Finished task 3.0 in stage 5.0 (TID 
> 18) in 72 ms on 172.17.0.6 (executor 2) (4/5)
>   20/09/17 11:08:06 INFO TaskSetManager: Finished task 4.0 in stage 5.0 (TID 
> 19) in 90 ms on 172.17.0.6 (executor 2) (5/5)
>   20/09/17 11:08:06 INFO TaskSchedulerImpl: Removed TaskSet 5.0, whose tasks 
> have all completed, from pool 
>   20/09/17 11:08:06 INFO DAGScheduler: ResultStage 5 (collect at 
> /opt/spark/tests/decommissioning.py:50) finished in 0.448 s
>   20/09/17 11:08:06 INFO DAGScheduler: Job 2 is finished. Cancelling 
> potential speculative or zombie tasks for this job
>   20/09/17 11:08:06 INFO TaskSchedulerImpl: Killing all running tasks in 
> stage 5: Stage finished
>   20/09/17 11:08:06 INFO DAGScheduler: Job 2 finished: collect at 
> /opt/spark/tests/decommissioning.py:50, took 0.460430 s
>   Final accumulator value is: 100
>   Finished waiting, stopping Spark.
>   20/09/17 11:08:06 INFO SparkUI: Stopped Spark web UI at 
> http://spark-test-app-08853d749bbee080-driver-svc.a0af92633bef4a91b5f7e262e919afd9.svc:4040
>   20/09/17 11:08:06 INFO KubernetesClusterSchedulerBackend: Shutting down all 
> executors
>   20/09/17 11:08:06 INFO 
> KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each 
> executor to shut down
>   20/09/17 11:08:06 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client 
> has been closed (this is expected if the application is shutting down.)
>   20/09/17 11:08:06 INFO MapOutputTrackerMasterEndpoint: 
> MapOutputTrackerMasterEndpoint stopped!
>   20/09/17 11:08:06 INFO MemoryStore: MemoryStore cleared
>   20/09/17 11:08:06 INFO BlockManager: BlockManager stopped
>   20/09/17 11:08:06 INFO BlockManagerMaster: BlockManagerMaster stopped
>   20/09/17 11:08:06 INFO 
> OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
> OutputCommitCoordinator stopped!
>   20/09/17 11:08:06 INFO SparkContext: Successfully stopped SparkContext
>   Done, exiting Python
>   20/09/17 11:08:07 INFO ShutdownHookManager: Shutdown hook called
>   20/09/17 11:08:07 INFO ShutdownHookManager: Deleting directory 
> /var/data/spark-7985c075-3b02-42ec-9111-cefba535adf0/spark-d5ac2f3e-fe8b-4122-8026-807d265f3a69/pyspark-62a6caeb-b2e5-4b8f-8eb3-e7b2c5fb155c
>   20/09/17 11:08:07 INFO ShutdownHookManager: Deleting directory 
> /var/data/spark-7985c075-3b02-42ec-9111-cefba535adf0/spark-d5ac2f3e-fe8b-4122-8026-807d265f3a69
>   20/09/17 11:08:07 INFO ShutdownHookManager: Deleting directory 
> /tmp/spark-b74e6224-3fa7-40d2-abc4-6622bd524e65
>   " did not contain "Received decommission executor message" The application 
> did not complete, did not find str Received decommission executor message. 
> (KubernetesSuite.scala:387)
> Run completed in 12 minutes, 29 seconds.
> Total number of tests run: 18
> Suites: completed 2, aborted 0
> Tests: succeeded 17, failed 1, canceled 0, ignored 0, pending 0
> *** 1 TEST FAILED ***
> [INFO] 
> ------------------------------------------------------------------------
> [INFO] Reactor Summary for Spark Project Parent POM 3.1.0-SNAPSHOT:
> [INFO] 
> [INFO] Spark Project Parent POM ........................... SUCCESS [  4.094 
> s]
> [INFO] Spark Project Tags ................................. SUCCESS [  8.630 
> s]
> [INFO] Spark Project Local DB ............................. SUCCESS [  4.062 
> s]
> [INFO] Spark Project Networking ........................... SUCCESS [  5.891 
> s]
> [INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [  3.059 
> s]
> [INFO] Spark Project Unsafe ............................... SUCCESS [ 10.869 
> s]
> [INFO] Spark Project Launcher ............................. SUCCESS [  3.432 
> s]
> [INFO] Spark Project Core ................................. SUCCESS [02:26 
> min]
> [INFO] Spark Project Kubernetes Integration Tests ......... FAILURE [15:13 
> min]
> [INFO] 
> ------------------------------------------------------------------------
> [INFO] BUILD FAILURE
> [INFO] 
> ------------------------------------------------------------------------
> [INFO] Total time:  18:21 min
> [INFO] Finished at: 2020-09-17T04:10:08-07:00
> [INFO] 
> ------------------------------------------------------------------------
> [ERROR] Failed to execute goal 
> org.scalatest:scalatest-maven-plugin:2.0.0:test (integration-test) on project 
> spark-kubernetes-integration-tests_2.12: There are test failures -> [Help 1]
> [ERROR] 
> [ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
> switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR] 
> [ERROR] For more information about the errors and possible solutions, please 
> read the following articles:
> [ERROR] [Help 1] 
> http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
> [ERROR] 
> [ERROR] After correcting the problems, you can resume the build with the 
> command
> [ERROR]   mvn <args> -rf :spark-kubernetes-integration-tests_2.12
> + retcode3=1
> + kill -9 82255
> + minikube stop
> :   Stopping "minikube" in kvm2 ...
> -   "minikube" stopped.
> /tmp/hudson6767824981271828433.sh: line 66: 82255 Killed                  
> minikube mount ${PVC_TESTS_HOST_PATH}:${PVC_TESTS_VM_PATH} 
> --9p-version=9p2000.L --gid=0 --uid=185
> + [[ 1 = 0 ]]
> + test_status=failure
> + /home/jenkins/bin/post_github_pr_comment.py
> Attempting to post to Github...
>  > Post successful.
> + rm -rf /tmp/tmp.epTpFHp0Dl
> + exit 1
> Build step 'Execute shell' marked build as failure
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-32937) DecomissionSuite in k8s integration tests is failing.

Reply via email to