[ 
https://issues.apache.org/jira/browse/SPARK-28733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-28733:
---------------------------------
    Description: 
I have encountered a strange behaviour recently, while reading a CSV file using 
DataFrameReader of org.apache.spark.sql package (Spark version 2.3.2). Here is 
my spark read code snippet.

_{color:#d04437}val sepChar = "\u00C7" // Ç{color}_
_{color:#d04437}val quoteChar = "\u1E1C" // Ḝ{color}_
_{color:#d04437}val escapeChar = "\u1E1D" // ḝ{color}_
_{color:#d04437}val inputCsvFile = 
"<some-local-windows-path>\\input_ab.csv"{color}_
 
_{color:#d04437}val readDF = sparkSession.read.option("sep", sepChar){color}_
 _{color:#d04437}.option("encoding", encoding.toUpperCase){color}_
 _{color:#d04437}.option("quote", quoteChar){color}_
 _{color:#d04437}.option("escape", escapeChar){color}_
 _{color:#d04437}.option("header", "false"){color}_
 _{color:#d04437}.option("multiLine", "true"){color}_
 _{color:#d04437}.csv(inputCsvFile){color}_
 _{color:#d04437}readDF.cache(){color}_
 _{color:#d04437}readDF.show(20, false){color}_

Due to some awful data, I'm forced to use some unicode characters as sep 
character, quote character, escape character instead of default ones. Below is 
my input sample data.

{color:#333333}*Ḝ1ḜÇḜsmithḜÇḜ5Ḝ*{color}
{color:#333333}*Ḝ2ḜÇḜdousonḜÇḜ6Ḝ*{color}
{color:#333333}*Ḝ3ḜÇḜsr,tendulkarḜÇḜ10Ḝ*{color}

Here Ç is field separator, Ḝ is quote character and all the fields values are 
wrapped with this custom quote character.

The problem I'm getting is, the first occurance of the quote character is not 
getting recognized by Spark somehow. I tried with any charcter other than 
Unicode like ` ~ X (alphabet X just for a testing scenario), even default quote 
(") as well. It works fine in all the scenarios except when Unicode is used as 
quote character. The first occurance of the Unicode quote character is coming 
as some non printable character �� , hence the wrap end quote character of the 
first field of first record is getting included in data.

Here is the output of df show.

{code]
+---+------------+-----+
|id |name |class|
+---+------------+-----+
|��1Ḝ |smith |5 |
|2 |douson |6 |
|3 |sr,tendulkar|10 |
+---+------------+-----+
{code]

It happens only for the first field of the very first record. Other quote 
characters in this file is being read as expected without any issues. When I 
keep an extra empty record at the top of the file, i.e., simply a new line (\n) 
at the very first line, the issue doesn't occur. Even, that empty row is not 
being considered as an empty record in df as well. Thus my problem gets solved. 
But this manipulation cannot be done in the production and hence it is an issue 
to be bothered about.

I feel, this is a bug. If it is not, kindly let me know the way to process the 
same without getting this issue; or else kindly provide a fix at the earliest. 
Thanks in advance.

Best Regards,
Mrinal

  was:
I have encountered a strange behaviour recently, while reading a CSV file using 
DataFrameReader of org.apache.spark.sql package (Spark version 2.3.2). Here is 
my spark read code snippet.

_{color:#d04437}val sepChar = "\u00C7" // Ç{color}_
_{color:#d04437}val quoteChar = "\u1E1C" // Ḝ{color}_
_{color:#d04437}val escapeChar = "\u1E1D" // ḝ{color}_
_{color:#d04437}val inputCsvFile = 
"<some-local-windows-path>\\input_ab.csv"{color}_
 
_{color:#d04437}val readDF = sparkSession.read.option("sep", sepChar){color}_
 _{color:#d04437}.option("encoding", encoding.toUpperCase){color}_
 _{color:#d04437}.option("quote", quoteChar){color}_
 _{color:#d04437}.option("escape", escapeChar){color}_
 _{color:#d04437}.option("header", "false"){color}_
 _{color:#d04437}.option("multiLine", "true"){color}_
 _{color:#d04437}.csv(inputCsvFile){color}_
 _{color:#d04437}readDF.cache(){color}_
 _{color:#d04437}readDF.show(20, false){color}_

Due to some awful data, I'm forced to use some unicode characters as sep 
character, quote character, escape character instead of default ones. Below is 
my input sample data.

{color:#333333}*Ḝ1ḜÇḜsmithḜÇḜ5Ḝ*{color}
{color:#333333}*Ḝ2ḜÇḜdousonḜÇḜ6Ḝ*{color}
{color:#333333}*Ḝ3ḜÇḜsr,tendulkarḜÇḜ10Ḝ*{color}

Here Ç is field separator, Ḝ is quote character and all the fields values are 
wrapped with this custom quote character.

The problem I'm getting is, the first occurance of the quote character is not 
getting recognized by Spark somehow. I tried with any charcter other than 
Unicode like ` ~ X (alphabet X just for a testing scenario), even default quote 
(") as well. It works fine in all the scenarios except when Unicode is used as 
quote character. The first occurance of the Unicode quote character is coming 
as some non printable character �� , hence the wrap end quote character of the 
first field of first record is getting included in data.

Here is the output of df show.

+---+------------+-----+
|id |name |class|
+---+------------+-----+
|��1Ḝ |smith |5 |
|2 |douson |6 |
|3 |sr,tendulkar|10 |
+---+------------+-----+

It happens only for the first field of the very first record. Other quote 
characters in this file is being read as expected without any issues. When I 
keep an extra empty record at the top of the file, i.e., simply a new line (\n) 
at the very first line, the issue doesn't occur. Even, that empty row is not 
being considered as an empty record in df as well. Thus my problem gets solved. 
But this manipulation cannot be done in the production and hence it is an issue 
to be bothered about.

I feel, this is a bug. If it is not, kindly let me know the way to process the 
same without getting this issue; or else kindly provide a fix at the earliest. 
Thanks in advance.

Best Regards,
Mrinal


> DataFrameReader of Spark not able to recognize the very first quote 
> character, while custom unicode quote character is used
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-28733
>                 URL: https://issues.apache.org/jira/browse/SPARK-28733
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.2
>            Reporter: Mrinal Bhattacherjee
>            Priority: Major
>              Labels: DataFrameReader, SparkCSV, dataframe
>
> I have encountered a strange behaviour recently, while reading a CSV file 
> using DataFrameReader of org.apache.spark.sql package (Spark version 2.3.2). 
> Here is my spark read code snippet.
> _{color:#d04437}val sepChar = "\u00C7" // Ç{color}_
> _{color:#d04437}val quoteChar = "\u1E1C" // Ḝ{color}_
> _{color:#d04437}val escapeChar = "\u1E1D" // ḝ{color}_
> _{color:#d04437}val inputCsvFile = 
> "<some-local-windows-path>\\input_ab.csv"{color}_
>  
> _{color:#d04437}val readDF = sparkSession.read.option("sep", sepChar){color}_
>  _{color:#d04437}.option("encoding", encoding.toUpperCase){color}_
>  _{color:#d04437}.option("quote", quoteChar){color}_
>  _{color:#d04437}.option("escape", escapeChar){color}_
>  _{color:#d04437}.option("header", "false"){color}_
>  _{color:#d04437}.option("multiLine", "true"){color}_
>  _{color:#d04437}.csv(inputCsvFile){color}_
>  _{color:#d04437}readDF.cache(){color}_
>  _{color:#d04437}readDF.show(20, false){color}_
> Due to some awful data, I'm forced to use some unicode characters as sep 
> character, quote character, escape character instead of default ones. Below 
> is my input sample data.
> {color:#333333}*Ḝ1ḜÇḜsmithḜÇḜ5Ḝ*{color}
> {color:#333333}*Ḝ2ḜÇḜdousonḜÇḜ6Ḝ*{color}
> {color:#333333}*Ḝ3ḜÇḜsr,tendulkarḜÇḜ10Ḝ*{color}
> Here Ç is field separator, Ḝ is quote character and all the fields values are 
> wrapped with this custom quote character.
> The problem I'm getting is, the first occurance of the quote character is not 
> getting recognized by Spark somehow. I tried with any charcter other than 
> Unicode like ` ~ X (alphabet X just for a testing scenario), even default 
> quote (") as well. It works fine in all the scenarios except when Unicode is 
> used as quote character. The first occurance of the Unicode quote character 
> is coming as some non printable character �� , hence the wrap end quote 
> character of the first field of first record is getting included in data.
> Here is the output of df show.
> {code]
> +---+------------+-----+
> |id |name |class|
> +---+------------+-----+
> |��1Ḝ |smith |5 |
> |2 |douson |6 |
> |3 |sr,tendulkar|10 |
> +---+------------+-----+
> {code]
> It happens only for the first field of the very first record. Other quote 
> characters in this file is being read as expected without any issues. When I 
> keep an extra empty record at the top of the file, i.e., simply a new line 
> (\n) at the very first line, the issue doesn't occur. Even, that empty row is 
> not being considered as an empty record in df as well. Thus my problem gets 
> solved. But this manipulation cannot be done in the production and hence it 
> is an issue to be bothered about.
> I feel, this is a bug. If it is not, kindly let me know the way to process 
> the same without getting this issue; or else kindly provide a fix at the 
> earliest. Thanks in advance.
> Best Regards,
> Mrinal



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to