[ 
https://issues.apache.org/jira/browse/AIRFLOW-5385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17006131#comment-17006131
 ] 

ASF GitHub Bot commented on AIRFLOW-5385:
-----------------------------------------

potiuk commented on pull request #6976: [AIRFLOW-5385] spark hook does not work 
on spark 2.3/2.4
URL: https://github.com/apache/airflow/pull/6976
 
 
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> SparkSubmit status spend lot of time
> ------------------------------------
>
>                 Key: AIRFLOW-5385
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-5385
>             Project: Apache Airflow
>          Issue Type: Improvement
>          Components: contrib
>    Affects Versions: 1.10.2
>            Reporter: Sergio Soto
>            Assignee: t oo
>            Priority: Blocker
>
> Hello,
> we have an issue with SparkSubmitOperator.  Airflow DAGs shows that some 
> streaming applications breaks out. I analyzed this behaviour. The 
> SparkSubmitHook is the responsable of check the driver status.
> We discovered some timeouts and tried to reproduce checking command. This is 
> an execution with `time`:
> {code:java}
> time /opt/java/jdk1.8.0_181/jre/bin/java -cp 
> /opt/shared/spark/client/conf/:/opt/shared/spark/client/jars/* -Xmx1g 
> org.apache.spark.deploy.SparkSubmit --master 
> spark://spark-master.corp.com:6066 --status driver-20190901180337-2749 
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 19/09/02 17:05:53 INFO RestSubmissionClient: Submitting a request for the 
> status of submission driver-20190901180337-2749 in 
> spark://lgmadbdtpspk01v.corp.logitravelgroup.com:6066.
> 19/09/02 17:05:59 INFO RestSubmissionClient: Server responded with 
> SubmissionStatusResponse:
> {
>   "action" : "SubmissionStatusResponse",
>   "driverState" : "RUNNING",
>   "serverSparkVersion" : "2.2.1",
>   "submissionId" : "driver-20190901180337-2749",
>   "success" : true,
>   "workerHostPort" : "172.25.10.194:45441",
>   "workerId" : "worker-20190821201014-172.25.10.194-45441"
> }
> real 0m11.598s 
> user 0m2.092s 
> sys 0m0.222s{code}
> We analyzed the Scala code and Spark API. This spark-submit status command 
> ends with a http get request to an url. Using curl, this is the time spent by 
> spark master to return status:
> {code:java}
>  time curl 
> "http://spark-master.corp.com:6066/v1/submissions/status/driver-20190901180337-2749";
> {
>   "action" : "SubmissionStatusResponse",
>   "driverState" : "RUNNING",
>   "serverSparkVersion" : "2.2.1",
>   "submissionId" : "driver-20190901180337-2749",
>   "success" : true,
>   "workerHostPort" : "172.25.10.194:45441",
>   "workerId" : "worker-20190821201014-172.25.10.194-45441"
> }
> real  0m0.011s
> user  0m0.000s
> sys   0m0.006s
> {code}
> Task spends 11.59 seconds with spark submit versus 0.011seconds with curl
> How can be this behaviour explained?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to