[
https://issues.apache.org/jira/browse/SPARK-28733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon resolved SPARK-28733.
----------------------------------
Resolution: Cannot Reproduce
Resolving this. It would be nicer if somebody identifies the JIRA that fixed
this issue and see if we can backport.
> DataFrameReader of Spark not able to recognize the very first quote
> character, while custom unicode quote character is used
> ---------------------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-28733
> URL: https://issues.apache.org/jira/browse/SPARK-28733
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.3.2
> Reporter: Mrinal Bhattacherjee
> Priority: Major
> Labels: DataFrameReader, SparkCSV, dataframe
>
> I have encountered a strange behaviour recently, while reading a CSV file
> using DataFrameReader of org.apache.spark.sql package (Spark version 2.3.2).
> Here is my spark read code snippet.
> _{color:#d04437}val sepChar = "\u00C7" // Ç{color}_
> _{color:#d04437}val quoteChar = "\u1E1C" // Ḝ{color}_
> _{color:#d04437}val escapeChar = "\u1E1D" // ḝ{color}_
> _{color:#d04437}val inputCsvFile =
> "<some-local-windows-path>\\input_ab.csv"{color}_
>
> _{color:#d04437}val readDF = sparkSession.read.option("sep", sepChar){color}_
> _{color:#d04437}.option("encoding", encoding.toUpperCase){color}_
> _{color:#d04437}.option("quote", quoteChar){color}_
> _{color:#d04437}.option("escape", escapeChar){color}_
> _{color:#d04437}.option("header", "false"){color}_
> _{color:#d04437}.option("multiLine", "true"){color}_
> _{color:#d04437}.csv(inputCsvFile){color}_
> _{color:#d04437}readDF.cache(){color}_
> _{color:#d04437}readDF.show(20, false){color}_
> Due to some awful data, I'm forced to use some unicode characters as sep
> character, quote character, escape character instead of default ones. Below
> is my input sample data.
> {color:#333333}*Ḝ1ḜÇḜsmithḜÇḜ5Ḝ*{color}
> {color:#333333}*Ḝ2ḜÇḜdousonḜÇḜ6Ḝ*{color}
> {color:#333333}*Ḝ3ḜÇḜsr,tendulkarḜÇḜ10Ḝ*{color}
> Here Ç is field separator, Ḝ is quote character and all the fields values are
> wrapped with this custom quote character.
> The problem I'm getting is, the first occurance of the quote character is not
> getting recognized by Spark somehow. I tried with any charcter other than
> Unicode like ` ~ X (alphabet X just for a testing scenario), even default
> quote (") as well. It works fine in all the scenarios except when Unicode is
> used as quote character. The first occurance of the Unicode quote character
> is coming as some non printable character �� , hence the wrap end quote
> character of the first field of first record is getting included in data.
> Here is the output of df show.
> {code}
> +---+------------+-----+
> |id |name |class|
> +---+------------+-----+
> |��1Ḝ |smith |5 |
> |2 |douson |6 |
> |3 |sr,tendulkar|10 |
> +---+------------+-----+
> {code}
> It happens only for the first field of the very first record. Other quote
> characters in this file is being read as expected without any issues. When I
> keep an extra empty record at the top of the file, i.e., simply a new line
> (\n) at the very first line, the issue doesn't occur. Even, that empty row is
> not being considered as an empty record in df as well. Thus my problem gets
> solved. But this manipulation cannot be done in the production and hence it
> is an issue to be bothered about.
> I feel, this is a bug. If it is not, kindly let me know the way to process
> the same without getting this issue; or else kindly provide a fix at the
> earliest. Thanks in advance.
> Best Regards,
> Mrinal
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]