[ 
https://issues.apache.org/jira/browse/SPARK-14194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kumaresh C R updated SPARK-14194:
---------------------------------
    Description: 
We have CSV content like below,

Sl.NO, Employee_Name, Company, Address, Country, ZIP_Code\n\r
"1", "ABCD", "XYZ", "1234", "XZ Street \n\r(CRLF charater), 
Municapality,....","USA", "1234567"

Since there is a '\n\r' character in the row middle (to be exact in the Address 
Column), when we execute the below spark code, it tries to create the dataframe 
with two rows (excluding header row), which is wrong. Since we have specified 
delimiter as quote (") character , why it takes the middle character as newline 
character ? This creates an issue while processing the created dataframe.

 DataFrame df = 
sqlContextManager.getSqlContext().read().format("com.databricks.spark.csv")
                    .option("header", "true")
                    .option("inferSchema", "true")
                    .option("delimiter", delim)
                    .option("quote", quote)
                    .option("escape", escape)
                    .load(sourceFile);

   


> spark csv reader not working properly if CSV content contains CRLF character 
> (newline) in the intermediate cell
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-14194
>                 URL: https://issues.apache.org/jira/browse/SPARK-14194
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.2
>            Reporter: Kumaresh C R
>
> We have CSV content like below,
> Sl.NO, Employee_Name, Company, Address, Country, ZIP_Code\n\r
> "1", "ABCD", "XYZ", "1234", "XZ Street \n\r(CRLF charater), 
> Municapality,....","USA", "1234567"
> Since there is a '\n\r' character in the row middle (to be exact in the 
> Address Column), when we execute the below spark code, it tries to create the 
> dataframe with two rows (excluding header row), which is wrong. Since we have 
> specified delimiter as quote (") character , why it takes the middle 
> character as newline character ? This creates an issue while processing the 
> created dataframe.
>  DataFrame df = 
> sqlContextManager.getSqlContext().read().format("com.databricks.spark.csv")
>                     .option("header", "true")
>                     .option("inferSchema", "true")
>                     .option("delimiter", delim)
>                     .option("quote", quote)
>                     .option("escape", escape)
>                     .load(sourceFile);
>    



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to