WamBamBoozle commented on a change in pull request #31162:
URL: https://github.com/apache/spark/pull/31162#discussion_r558449658
##########
File path: R/pkg/inst/worker/daemon.R
##########
@@ -32,6 +32,9 @@ inputCon <- socketConnection(
SparkR:::doServerAuth(inputCon, Sys.getenv("SPARKR_WORKER_SECRET"))
+# Application-specific daemon initialization. Typical use is loading libraries.
+eval(parse(text = Sys.getenv("SPARKR_DAEMON_INIT")))
+
Review comment:
To give you a sense of the cost of this, consider
```
> microbenchmark(NULL, times = 999999)
Unit: nanoseconds
expr min lq mean median uq max neval
NULL 2 4 4.607765 4 5 11552 999999
```
so on my 2018 MacBook Pro 2.2 GHz 6-Core Intel Core i7, R evaluates NULL in
4 nanoseconds.
```
> Sys.setenv(x="NULL")
> microbenchmark(eval(parse(text = Sys.getenv("x"))), times=99999)
Unit: microseconds
expr min lq mean median uq
max neval
eval(parse(text = Sys.getenv("x"))) 33.854 35.82 40.15034 37.479 39.4475
7219.072 99999
```
It takes 40 microseconds to unpack the environment variable and evaluate it.
For comparison, consider
- the 6 milliseconds we use loading worker.R at every fork (we could load
it once and then invoke it as a function thus saving 6 milliseconds off of
every fork).
- the time applications save by moving their initialization here. For
example, our application takes 5 seconds to load its libraries, and we've 40
thousand partitions. 50 s * 40,000 = 56 hours cpu-time saved for every call to
gapply
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]