[ 
https://issues.apache.org/jira/browse/SPARK-42198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699044#comment-17699044
 ] 

Sean R. Owen commented on SPARK-42198:
--------------------------------------

You would not add /dbfs on Databricks in this case, that's not relevant or the 
issue.
What if you escape the path as if in a URL?

> spark.read fails to read filenames with accented characters
> -----------------------------------------------------------
>
>                 Key: SPARK-42198
>                 URL: https://issues.apache.org/jira/browse/SPARK-42198
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.2.1
>            Reporter: Tarique Anwer
>            Priority: Major
>
> Unable to read filenames with accented characters in the filename.
> *Sample error:*
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 43 in 
> stage 1.0 failed 4 times, most recent failure: Lost task 43.3 in stage 1.0 
> (TID 105) (10.139.64.5 executor 0): java.io.FileNotFoundException: 
> /4842022074360943/user/hive/warehouse/hls_cms_source.db/raw_files/synthea_mass/ccda/Amalia471_Magaña874_3912696a-0aef-492e-83ef-468262b82966.xml{code}
>  
> *{{Steps to reproduce error:}}*
> {code:java}
> %sh
> mkdir -p /dbfs/user/hive/warehouse/hls_cms_source.db/raw_files/synthea_mass
> wget  
> https://synthetichealth.github.io/synthea-sample-data/downloads/synthea_sample_data_ccda_sep2019.zip
>  -O ./synthea_sample_data_ccda_sep2019.zip 
> unzip ./synthea_sample_data_ccda_sep2019.zip -d 
> /dbfs/user/hive/warehouse/hls_cms_source.db/raw_files/synthea_mass/
> {code}
>  
> {code:java}
> spark.conf.set("spark.sql.caseSensitive", "true")
> df = (
>   spark.read.format('xml')
>    .option("rowTag", "ClinicalDocument")
>   .load('/user/hive/warehouse/hls_cms_source.db/raw_files/synthea_mass/ccda/')
> ){code}
> Is there a way to deal with this situation where I don't have control over 
> the file names for some reason?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to