Hyukjin Kwon updated SPARK-28733:
    Priority: Major  (was: Critical)

> DataFrameReader of Spark not able to recognize the very first quote 
> character, while custom unicode quote character is used
> ---------------------------------------------------------------------------------------------------------------------------
>                 Key: SPARK-28733
>                 URL: https://issues.apache.org/jira/browse/SPARK-28733
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.2
>            Reporter: Mrinal Bhattacherjee
>            Priority: Major
>              Labels: DataFrameReader, SparkCSV, dataframe
> I have encountered a strange behaviour recently, while reading a CSV file 
> using DataFrameReader of org.apache.spark.sql package (Spark version 2.3.2). 
> Here is my spark read code snippet.
> _{color:#d04437}val sepChar = "\u00C7" // Ç{color}_
> _{color:#d04437}val quoteChar = "\u1E1C" // Ḝ{color}_
> _{color:#d04437}val escapeChar = "\u1E1D" // ḝ{color}_
> _{color:#d04437}val inputCsvFile = 
> "<some-local-windows-path>\\input_ab.csv"{color}_
> _{color:#d04437}val readDF = sparkSession.read.option("sep", sepChar){color}_
>  _{color:#d04437}.option("encoding", encoding.toUpperCase){color}_
>  _{color:#d04437}.option("quote", quoteChar){color}_
>  _{color:#d04437}.option("escape", escapeChar){color}_
>  _{color:#d04437}.option("header", "false"){color}_
>  _{color:#d04437}.option("multiLine", "true"){color}_
>  _{color:#d04437}.csv(inputCsvFile){color}_
>  _{color:#d04437}readDF.cache(){color}_
>  _{color:#d04437}readDF.show(20, false){color}_
> Due to some awful data, I'm forced to use some unicode characters as sep 
> character, quote character, escape character instead of default ones. Below 
> is my input sample data.
> {color:#333333}*Ḝ1ḜÇḜsmithḜÇḜ5Ḝ*{color}
> {color:#333333}*Ḝ2ḜÇḜdousonḜÇḜ6Ḝ*{color}
> {color:#333333}*Ḝ3ḜÇḜsr,tendulkarḜÇḜ10Ḝ*{color}
> Here Ç is field separator, Ḝ is quote character and all the fields values are 
> wrapped with this custom quote character.
> The problem I'm getting is, the first occurance of the quote character is not 
> getting recognized by Spark somehow. I tried with any charcter other than 
> Unicode like ` ~ X (alphabet X just for a testing scenario), even default 
> quote (") as well. It works fine in all the scenarios except when Unicode is 
> used as quote character. The first occurance of the Unicode quote character 
> is coming as some non printable character �� , hence the wrap end quote 
> character of the first field of first record is getting included in data.
> Here is the output of df show.
> +---+------------+-----+
> |id |name |class|
> +---+------------+-----+
> |��1Ḝ |smith |5 |
> |2 |douson |6 |
> |3 |sr,tendulkar|10 |
> +---+------------+-----+
> It happens only for the first field of the very first record. Other quote 
> characters in this file is being read as expected without any issues. When I 
> keep an extra empty record at the top of the file, i.e., simply a new line 
> (\n) at the very first line, the issue doesn't occur. Even, that empty row is 
> not being considered as an empty record in df as well. Thus my problem gets 
> solved. But this manipulation cannot be done in the production and hence it 
> is an issue to be bothered about.
> I feel, this is a bug. If it is not, kindly let me know the way to process 
> the same without getting this issue; or else kindly provide a fix at the 
> earliest. Thanks in advance.
> Best Regards,
> Mrinal

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to