[jira] [Updated] (SPARK-33864) How can we submit or initiate multiple application with single or few JVM

Ramesha Bhatta (Jira) Wed, 03 Mar 2021 02:58:09 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-33864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ramesha Bhatta updated SPARK-33864:
-----------------------------------
    Description: 
How can we have single JVM or few JVM process submit multiple application to 
cluster.

It is observed that each spark-submit opens upto 400 JARS of >1GB size and 
creates  __spark_conf__XXXX.zip in /tmp  and copy under application specific 
.staging directory.    When run concurrently for # of JVMs that can be 
supported in a server is limited and submit alone takes 

In our use-case, literally millions of time creation of this zip file before 
any actual change in configuration is not efficient and there should have been 
an option to create this on need basis and option to re-use (cache).

Direct impact is any submission with concurrency >40 (#of hyperthreaded cores) 
leads to failure and CPU overload on GW. Tried Livy, however noticed, in the 
background this solution also does a spark-submit and same problem persists and 
getting "response code 404" and observe the same CPU overload on server running 
livy. The concurrency is due to mini-batches over REST and expecting and try to 
support 2000+ concurrent requests as long as we have the resource to support in 
the cluster. For this spark-submit is the major bottleneck because of the 
explained situation. For JARS submission, we have more than one work-around 
(1.pre-distribute the jars to a specified folder and refer local keyword or 2) 
stage the JARS in a HDFS location and specify HDFS reference thus no file-copy 
per application).

Looking at the code yarn/Client.scala, it appeared possible to make change in 
the spark-submit and thus raising a enhancement request. 
 Please prioritize.

I guess, the change needed is in 
[https://github.com/apache/spark/blob/48f93af9f3d40de5bf087eb1a06c1b9954b2ad76/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala]
 line 745 ( "val confArchive = File.createTempFile(LOCALIZED_CONF_DIR, ".zip", 
new File(Utils.getLocalDir(sparkConf))) )....

Adding some logic like the last time the file created/file-existence etc. and 
avoid re-creating again repetitively/excessively is right thing to do.

Second change is avoid distributing this for every application and reuse from 
shared HDFS location.
 ==

// Upload the conf archive to HDFS manually, and record its location in the 
configuration.
 // This will allow the AM to know where the conf archive is in HDFS, so that 
it can be
 // distributed to the containers.
 //
 // This code forces the archive to be copied, so that unit tests pass (since 
in that case both
 // file systems are the same and the archive wouldn't normally be copied). In 
most (all?)
 // deployments, the archive would be copied anyway, since it's a temp file in 
the local file
 // system.
 val remoteConfArchivePath = new Path(destDir, LOCALIZED_CONF_ARCHIVE)
 val remoteFs = FileSystem.get(remoteConfArchivePath.toUri(), hadoopConf)
 cachedResourcesConf.set(CACHED_CONF_ARCHIVE, remoteConfArchivePath.toString())

val localConfArchive = new Path(createConfArchive().toURI())
 copyFileToRemote(destDir, localConfArchive, replication, symlinkCache, force = 
true,
 destName = Some(LOCALIZED_CONF_ARCHIVE))
 ===

Regards,
 -Ramesh

  was:
Avoid re-creating __spark_conf__5678XXXX.zip in /tmp for each application 
submit and copy under application specific .staging directory

In our use-case, literally millions of time creation of this zip file before 
any actual change in configuration is not efficient and there should have been 
an option to create this on need basis and option to re-use (cache).

Direct impact is any submission with concurrency >40 (#of hyperthreaded cores) 
leads to failure and CPU overload on GW. Tried Livy, however noticed, in the 
background this solution also does a spark-submit and same problem persists and 
getting "response code 404" and observe the same CPU overload on server running 
livy. The concurrency is due to mini-batches over REST and expecting and try to 
support 2000+ concurrent requests as long as we have the resource to support in 
the cluster. For this spark-submit is the major bottleneck because of the 
explained situation. For JARS submission, we have more than one work-around 
(1.pre-distribute the jars to a specified folder and refer local keyword or 2) 
stage the JARS in a HDFS location and specify HDFS reference thus no file-copy 
per application).

Looking at the code yarn/Client.scala, it appeared possible to make change in 
the spark-submit and thus raising a enhancement request. 
Please prioritize.

I guess, the change needed is in 
https://github.com/apache/spark/blob/48f93af9f3d40de5bf087eb1a06c1b9954b2ad76/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
 line 745 ( "val confArchive = File.createTempFile(LOCALIZED_CONF_DIR, ".zip", 
new File(Utils.getLocalDir(sparkConf))) )....

Adding some logic like the last time the file created/file-existence etc. and 
avoid re-creating again repetitively/excessively is right thing to do.

Second change is avoid distributing this for every application and reuse from 
shared HDFS location.
==

// Upload the conf archive to HDFS manually, and record its location in the 
configuration.
// This will allow the AM to know where the conf archive is in HDFS, so that it 
can be
// distributed to the containers.
//
// This code forces the archive to be copied, so that unit tests pass (since in 
that case both
// file systems are the same and the archive wouldn't normally be copied). In 
most (all?)
// deployments, the archive would be copied anyway, since it's a temp file in 
the local file
// system.
val remoteConfArchivePath = new Path(destDir, LOCALIZED_CONF_ARCHIVE)
val remoteFs = FileSystem.get(remoteConfArchivePath.toUri(), hadoopConf)
cachedResourcesConf.set(CACHED_CONF_ARCHIVE, remoteConfArchivePath.toString())

val localConfArchive = new Path(createConfArchive().toURI())
copyFileToRemote(destDir, localConfArchive, replication, symlinkCache, force = 
true,
destName = Some(LOCALIZED_CONF_ARCHIVE))
===

Regards,
-Ramesh


> How can we submit or initiate multiple application with single or few JVM
> -------------------------------------------------------------------------
>
>                 Key: SPARK-33864
>                 URL: https://issues.apache.org/jira/browse/SPARK-33864
>             Project: Spark
>          Issue Type: Improvement
>          Components: Deploy
>    Affects Versions: 2.4.5
>            Reporter: Ramesha Bhatta
>            Priority: Major
>
> How can we have single JVM or few JVM process submit multiple application to 
> cluster.
> It is observed that each spark-submit opens upto 400 JARS of >1GB size and 
> creates  __spark_conf__XXXX.zip in /tmp  and copy under application specific 
> .staging directory.    When run concurrently for # of JVMs that can be 
> supported in a server is limited and submit alone takes 
> In our use-case, literally millions of time creation of this zip file before 
> any actual change in configuration is not efficient and there should have 
> been an option to create this on need basis and option to re-use (cache).
> Direct impact is any submission with concurrency >40 (#of hyperthreaded 
> cores) leads to failure and CPU overload on GW. Tried Livy, however noticed, 
> in the background this solution also does a spark-submit and same problem 
> persists and getting "response code 404" and observe the same CPU overload on 
> server running livy. The concurrency is due to mini-batches over REST and 
> expecting and try to support 2000+ concurrent requests as long as we have the 
> resource to support in the cluster. For this spark-submit is the major 
> bottleneck because of the explained situation. For JARS submission, we have 
> more than one work-around (1.pre-distribute the jars to a specified folder 
> and refer local keyword or 2) stage the JARS in a HDFS location and specify 
> HDFS reference thus no file-copy per application).
> Looking at the code yarn/Client.scala, it appeared possible to make change in 
> the spark-submit and thus raising a enhancement request. 
>  Please prioritize.
> I guess, the change needed is in 
> [https://github.com/apache/spark/blob/48f93af9f3d40de5bf087eb1a06c1b9954b2ad76/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala]
>  line 745 ( "val confArchive = File.createTempFile(LOCALIZED_CONF_DIR, 
> ".zip", new File(Utils.getLocalDir(sparkConf))) )....
> Adding some logic like the last time the file created/file-existence etc. and 
> avoid re-creating again repetitively/excessively is right thing to do.
> Second change is avoid distributing this for every application and reuse from 
> shared HDFS location.
>  ==
> // Upload the conf archive to HDFS manually, and record its location in the 
> configuration.
>  // This will allow the AM to know where the conf archive is in HDFS, so that 
> it can be
>  // distributed to the containers.
>  //
>  // This code forces the archive to be copied, so that unit tests pass (since 
> in that case both
>  // file systems are the same and the archive wouldn't normally be copied). 
> In most (all?)
>  // deployments, the archive would be copied anyway, since it's a temp file 
> in the local file
>  // system.
>  val remoteConfArchivePath = new Path(destDir, LOCALIZED_CONF_ARCHIVE)
>  val remoteFs = FileSystem.get(remoteConfArchivePath.toUri(), hadoopConf)
>  cachedResourcesConf.set(CACHED_CONF_ARCHIVE, 
> remoteConfArchivePath.toString())
> val localConfArchive = new Path(createConfArchive().toURI())
>  copyFileToRemote(destDir, localConfArchive, replication, symlinkCache, force 
> = true,
>  destName = Some(LOCALIZED_CONF_ARCHIVE))
>  ===
> Regards,
>  -Ramesh



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-33864) How can we submit or initiate multiple application with single or few JVM

Reply via email to