[GitHub] [spark] HyukjinKwon commented on a change in pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

GitBox Tue, 24 Nov 2020 06:34:11 -0800


HyukjinKwon commented on a change in pull request #30486:
URL: https://github.com/apache/spark/pull/30486#discussion_r529590417




##########
File path: core/src/main/scala/org/apache/spark/util/Utils.scala
##########
@@ -535,13 +538,23 @@ private[spark] object Utils extends Logging {
       doFetchFile(url, targetDir, fileName, conf, securityMgr, hadoopConf)
     }
 
-    // Decompress the file if it's a .tar or .tar.gz
-    if (fileName.endsWith(".tar.gz") || fileName.endsWith(".tgz")) {
-      logInfo("Untarring " + fileName)
-      executeAndGetOutput(Seq("tar", "-xzf", fileName), targetDir)
-    } else if (fileName.endsWith(".tar")) {
-      logInfo("Untarring " + fileName)
-      executeAndGetOutput(Seq("tar", "-xf", fileName), targetDir)
+    if (shouldUntar) {
+      // Decompress the file if it's a .tar or .tar.gz
+      if (fileName.endsWith(".tar.gz") || fileName.endsWith(".tgz")) {
+        logWarning(
+          "Untarring behavior is deprecated at spark.files and " +
+            "SparkContext.addFile. Use spark.archives or 
SparkContext.addArchive " +
+            "instead.")
+        logInfo("Untarring " + fileName)
+        executeAndGetOutput(Seq("tar", "-xzf", fileName), targetDir)

Review comment:
       Our `spark.files` and `SparkContext.addFile` have a sort of undocumented 
and hidden behaviour. Only in executor side, it untars if the files are 
`.tar.gz` or `tgz`. I think it makes sense to deprecate this behaviour and 
encourage users to use explicit archive handling.
   
   Also, I believe it's a good practice to avoid relying on external programs 
anyway.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HyukjinKwon commented on a change in pull request #30486: [SPARK-33530][CORE] Support --archives and spark.archives option natively

Reply via email to