[ 
https://issues.apache.org/jira/browse/AIRFLOW-5385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16922312#comment-16922312
 ] 

Diego García commented on AIRFLOW-5385:
---------------------------------------

Yes [~ash] , it's not a problem with Airflow. However, the cpu use running the 
spark-submit --state binary due to the JVM, loading libraries, etc. is 
overwhelming to only check the driver's state.

The figure below shows the decrease of CPU usage after overriding the contrib 
airflow/contrib/operators/spark_submit_operator.py to use the curl approach 
(running 2 DAGs in a worker with 4 cpus):

!image-2019-09-04-11-04-11-937.png!

 In order to be the Spark streaming drivers status tracking stable and scalable 
some action should be considered at least. Maybe lightening the spark-submit 
--status by the Spark developers...

We needed 2 workers with 4 CPUs to only handle 4 DAGs (Spark Streaming apps) 
which eventually trigger false alarms (after ~2 weeks running).

Please find the attached file with the code change.

Regards,

 

> SparkSubmit status spend lot of time
> ------------------------------------
>
>                 Key: AIRFLOW-5385
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-5385
>             Project: Apache Airflow
>          Issue Type: Improvement
>          Components: contrib
>    Affects Versions: 1.10.2
>            Reporter: Sergio Soto
>            Priority: Blocker
>
> Hello,
> we have an issue with SparkSubmitOperator.  Airflow DAGs shows that some 
> streaming applications breaks out. I analyzed this behaviour. The 
> SparkSubmitHook is the responsable of check the driver status.
> We discovered some timeouts and tried to reproduce checking command. This is 
> an execution with `time`:
> {code:java}
> time /opt/java/jdk1.8.0_181/jre/bin/java -cp 
> /opt/shared/spark/client/conf/:/opt/shared/spark/client/jars/* -Xmx1g 
> org.apache.spark.deploy.SparkSubmit --master 
> spark://spark-master.corp.com:6066 --status driver-20190901180337-2749 
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 19/09/02 17:05:53 INFO RestSubmissionClient: Submitting a request for the 
> status of submission driver-20190901180337-2749 in 
> spark://lgmadbdtpspk01v.corp.logitravelgroup.com:6066.
> 19/09/02 17:05:59 INFO RestSubmissionClient: Server responded with 
> SubmissionStatusResponse:
> {
>   "action" : "SubmissionStatusResponse",
>   "driverState" : "RUNNING",
>   "serverSparkVersion" : "2.2.1",
>   "submissionId" : "driver-20190901180337-2749",
>   "success" : true,
>   "workerHostPort" : "172.25.10.194:45441",
>   "workerId" : "worker-20190821201014-172.25.10.194-45441"
> }
> real 0m11.598s 
> user 0m2.092s 
> sys 0m0.222s{code}
> We analyzed the Scala code and Spark API. This spark-submit status command 
> ends with a http get request to an url. Using curl, this is the time spent by 
> spark master to return status:
> {code:java}
>  time curl 
> "http://spark-master.corp.com:6066/v1/submissions/status/driver-20190901180337-2749";
> {
>   "action" : "SubmissionStatusResponse",
>   "driverState" : "RUNNING",
>   "serverSparkVersion" : "2.2.1",
>   "submissionId" : "driver-20190901180337-2749",
>   "success" : true,
>   "workerHostPort" : "172.25.10.194:45441",
>   "workerId" : "worker-20190821201014-172.25.10.194-45441"
> }
> real  0m0.011s
> user  0m0.000s
> sys   0m0.006s
> {code}
> Task spends 11.59 seconds with spark submit versus 0.011seconds with curl
> How can be this behaviour explained?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to