[ 
https://issues.apache.org/jira/browse/SPARK-26259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26259:
---------------------------------
    Component/s:     (was: Spark Core)
                 SQL

> RecordSeparator other than newline discovers incorrect schema
> -------------------------------------------------------------
>
>                 Key: SPARK-26259
>                 URL: https://issues.apache.org/jira/browse/SPARK-26259
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: PoojaMurarka
>            Priority: Major
>
> Though JIRA: https://issues.apache.org/jira/browse/SPARK-21289 has been fixed 
> in SPARK 2.3 which allows record Separators other than new line but this 
> doesn't work when schema is not specified i.e. while inferring the schema
>  Let me try to explain this using below data and scenarios:
> Input Data - (input_data.csv) as shown below: *+where recordSeparator is 
> "\t"+*
> {noformat}
> "dteday","hr","holiday","weekday","workingday","weathersit","temp","atemp","hum","windspeed"
>     "2012-01-01","0","0","0","0","1","9","9.1","66","0"    
> "2012-01-01","1","0","0","0","1","9","7.2","66","9"{noformat}
> *Case 1: Schema Defined *: Below Spark code with defined *schema* reads data 
> correctly:
> {code:java}
> val customSchema = StructType(Array(
>         StructField("dteday", DateType, true),
>         StructField("hr", IntegerType, true),
>         StructField("holiday", IntegerType, true),
>         StructField("weekday", IntegerType, true),
>         StructField("workingday", DateType, true),
>         StructField("weathersit", IntegerType, true),
>         StructField("temp", IntegerType, true),
>         StructField("atemp", DoubleType, true),
>         StructField("hum", IntegerType, true),
>         StructField("windspeed", IntegerType, true)));
> Dataset<Row> ds = executionContext.getSparkSession().read().format( "csv" )
>           .option( "header", true )
>           .option( "schema", customSchema)
>           .option( "sep", "," )
>           .load( "input_data.csv" );
> {code}
> *Case 2: Schema not defined (inferSchema is used):* Incorrect data parsing is 
> done i.e. entire data is read as column names.
> {code:java}
> Dataset<Row> ds = executionContext.getSparkSession().read().format( "csv" )
>           .option( "header", true )
>           .option( "inferSchema", true)
>           .option( "sep", "," )
>           .load( "input_data.csv" );
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to