[
https://issues.apache.org/jira/browse/SPARK-42198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17681771#comment-17681771
]
Narek Karapetian commented on SPARK-42198:
------------------------------------------
I think, you have missed `dbfs` part in the path.
{code:java}
spark.conf.set("spark.sql.caseSensitive", "true")
df = (
spark.read.format('xml')
.option("rowTag", "ClinicalDocument")
.load('/dbfs/user/hive/warehouse/hls_cms_source.db/raw_files/synthea_mass/ccda/José_Emilio366_Macías944_1e740307-8780-4542-abeb-7037a2557a0e.xml')
) {code}
it works for me.
> spark.read fails to read filenames with accented characters
> -----------------------------------------------------------
>
> Key: SPARK-42198
> URL: https://issues.apache.org/jira/browse/SPARK-42198
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 3.2.1
> Reporter: Tarique Anwer
> Priority: Major
>
> Unable to read filenames with accented characters in the filename.
> *Sample error:*
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 43 in
> stage 1.0 failed 4 times, most recent failure: Lost task 43.3 in stage 1.0
> (TID 105) (10.139.64.5 executor 0): java.io.FileNotFoundException:
> /4842022074360943/user/hive/warehouse/hls_cms_source.db/raw_files/synthea_mass/ccda/Amalia471_Magaña874_3912696a-0aef-492e-83ef-468262b82966.xml{code}
>
> *{{Steps to reproduce error:}}*
> {code:java}
> %sh
> mkdir -p /dbfs/user/hive/warehouse/hls_cms_source.db/raw_files/synthea_mass
> wget
> https://synthetichealth.github.io/synthea-sample-data/downloads/synthea_sample_data_ccda_sep2019.zip
> -O ./synthea_sample_data_ccda_sep2019.zip
> unzip ./synthea_sample_data_ccda_sep2019.zip -d
> /dbfs/user/hive/warehouse/hls_cms_source.db/raw_files/synthea_mass/
> {code}
>
> {code:java}
> spark.conf.set("spark.sql.caseSensitive", "true")
> df = (
> spark.read.format('xml')
> .option("rowTag", "ClinicalDocument")
>
> .load('/user/hive/warehouse/hls_cms_source.db/raw_files/synthea_mass/ccda/José_Emilio366_Macías944_1e740307-8780-4542-abeb-7037a2557a0e.xml')
> ){code}
> Is there a way to deal with this situation where I don't have control over
> the file names for some reason?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]