HyukjinKwon commented on a change in pull request #30486:
URL: https://github.com/apache/spark/pull/30486#discussion_r529590417
##########
File path: core/src/main/scala/org/apache/spark/util/Utils.scala
##########
@@ -535,13 +538,23 @@ private[spark] object Utils extends Logging {
doFetchFile(url, targetDir, fileName, conf, securityMgr, hadoopConf)
}
- // Decompress the file if it's a .tar or .tar.gz
- if (fileName.endsWith(".tar.gz") || fileName.endsWith(".tgz")) {
- logInfo("Untarring " + fileName)
- executeAndGetOutput(Seq("tar", "-xzf", fileName), targetDir)
- } else if (fileName.endsWith(".tar")) {
- logInfo("Untarring " + fileName)
- executeAndGetOutput(Seq("tar", "-xf", fileName), targetDir)
+ if (shouldUntar) {
+ // Decompress the file if it's a .tar or .tar.gz
+ if (fileName.endsWith(".tar.gz") || fileName.endsWith(".tgz")) {
+ logWarning(
+ "Untarring behavior is deprecated at spark.files and " +
+ "SparkContext.addFile. Use spark.archives or
SparkContext.addArchive " +
+ "instead.")
+ logInfo("Untarring " + fileName)
+ executeAndGetOutput(Seq("tar", "-xzf", fileName), targetDir)
Review comment:
Our `spark.files` and `SparkContext.addFile` have a sort of undocumented
and hidden behaviour. Only in executor side, it untars if the files are
`.tar.gz` or `tgz`. I think it makes sense to deprecate this behaviour and
encourage users to use explicit archive handling.
Also, I believe it's a good practice to avoid relying on external programs
anyway.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]