gaogaotiantian commented on code in PR #54258:
URL: https://github.com/apache/spark/pull/54258#discussion_r2801534033


##########
python/pyspark/worker_util.py:
##########
@@ -155,39 +155,42 @@ def setup_spark_files(infile: IO) -> None:
 def setup_broadcasts(infile: IO) -> None:
     """
     Set up broadcasted variables.
+    {
+        "conn_info": int | str | None,
+        "auth_secret": str | None,
+        "broadcast_variables": [
+            {
+                "bid": int,
+                "path": str | None,
+            }
+        ]
+    }
     """
     if not is_remote_only():
         from pyspark.core.broadcast import Broadcast, _broadcastRegistry
 
-    # fetch names and values of broadcast variables
-    needs_broadcast_decryption_server = read_bool(infile)
-    num_broadcast_variables = read_int(infile)
-    if needs_broadcast_decryption_server:
+    data = json.loads(utf8_deserializer.loads(infile))

Review Comment:
   Performance impact is a big red herring. This change introduced two kinds of 
"overhead":
   * CPU time to encode/decode json
   * Extra bytes through the JVM/worker network (it's on the same machine)
   
   Decoding a small json string takes about 1us. It's probably on the same 
range on scala side. Local network runs at least 10Gbps, an extra 100 bytes 
takes about 0.1us.
   
   That's the overhead we introduce for every UDF run.
   
   Currently, without reuse-worker, each worker takes about a few hundred ms to 
spawn. I made an optimization a few weeks ago that eliminated 100-200ms per 
spawn for reused worker and no one even notice it.
   
   1us is 0.01% of 100ms. That's literally nothing. If we care about 1us, we 
have serious issues with our current UDF. I can get a lot of 1us from our 
current code if that's what we need to make our protocol more stable.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to