[ 
https://issues.apache.org/jira/browse/SPARK-37958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37958:
------------------------------------

    Assignee:     (was: Apache Spark)

> Pyspark SparkContext.AddFile() does not respect spark.files.overwrite
> ---------------------------------------------------------------------
>
>                 Key: SPARK-37958
>                 URL: https://issues.apache.org/jira/browse/SPARK-37958
>             Project: Spark
>          Issue Type: Bug
>          Components: Documentation, Input/Output, Java API
>    Affects Versions: 3.1.1
>            Reporter: taylor schneider
>            Priority: Major
>
> I am currently running apache spark 3.1.1. on kubernetes.
> When I try to re-add a file that has already been added I see that the 
> updated file is not actually loaded into the cluster. I see the following 
> warning when calling the addFile() function.
> {code:java}
> 22/01/18 19:05:50 WARN SparkContext: The path 
> http://15.4.12.12:80/demo_data.csv has been added already. Overwriting of 
> added paths is not supported in the current version. {code}
> When I display the dataframe that was loaded I see that the old data is 
> loaded. If I log into the worker pods and delete the file, the same results 
> or observed.
> My SparkConf has the following configurations
> {code:java}
> ('spark.master', 'k8s://https://15.4.7.11:6443')
> ('spark.app.name', 'spark-jupyter-mlib')
> ('spark.submit.deploy.mode', 'cluster')
> ('spark.kubernetes.container.image', 'tschneider/apache-spark-k8:v7')
> ('spark.kubernetes.namespace', 'spark')
> ('spark.kubernetes.pyspark.pythonVersion', '3')
> ('spark.kubernetes.authenticate.driver.serviceAccountName', 'spark-sa')
> ('spark.kubernetes.authenticate.serviceAccountName', 'spark-sa')
> ('spark.executor.instances', '3')
> ('spark.executor.cores', '2')
> ('spark.executor.memory', '4096m')
> ('spark.executor.memoryOverhead', '1024m')
> ('spark.driver.memory', '1024m')
> ('spark.driver.host', '15.4.12.12')
> ('spark.files.overwrite', 'true')
> ('spark.files.useFetchCache', 'false') {code}
> According to the documentation for 3.1.1. The spark.files.overwrite parameter 
> should in fact load the updated files. The documentation can be found here: 
> [https://spark.apache.org/docs/3.1.1/configuration.html]
> The only workaround is to use a python function to manually delete and 
> re-download the file. Calling addFile still shows the warning in this case. 
> My code for the delete and redownload is as follows:
> {code:java}
> def os_remove(file_path):
>     import socket
>     hostname = socket.gethostname()    action = None
>     import os
>     if os.path.exists(file_path):
>         action = "delete"
>         os.remove(file_path)
>         
>     return (hostname, action)worker_file_path = 
> u"file:///{0}".format(csv_file_name)
> worker_count = int(spark_session.conf.get('spark.executor.instances'))
> rdd = sc.parallelize(range(worker_count)).map(lambda var: 
> os_remove(worker_file_path))
> rdd.collect()
> def download_updated_file(file_url):
>     import urllib.parse as parse
>     file_name = os.path.basename(parse.urlparse(csv_file_url).path)
>     local_file_path = "/{0}".format(file_name)
>     
>     import urllib.request as urllib
>     urllib.urlretrieve(file_url, local_file_path)
>     
> rdd = sc.parallelize(range(worker_count)).map(lambda var: 
> download_updated_file(csv_file_url))
> rdd.collect(){code}
> I believe this is either a bug or a documentation mistake. Perhaps the 
> configuration parameter has a misleading description?
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to