PoojaMurarka created SPARK-26259:
------------------------------------
Summary: RecordSeparator other than newline discovers incorrect
schema
Key: SPARK-26259
URL: https://issues.apache.org/jira/browse/SPARK-26259
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 2.4.0
Reporter: PoojaMurarka
Fix For: 2.4.1
Though JIRA: https://issues.apache.org/jira/browse/SPARK-21289 has been fixed
in SPARK 2.3 which allows record Separators other than new line but this
doesn't work when schema is not specified i.e. while inferring the schema
Let me try to explain this using below data and scenarios:
Input Data - (input_data.csv) as shown below: *+where recordSeparator is "\t"+*
{noformat}
"dteday","hr","holiday","weekday","workingday","weathersit","temp","atemp","hum","windspeed"
"2012-01-01","0","0","0","0","1","9","9.1","66","0"
"2012-01-01","1","0","0","0","1","9","7.2","66","9"{noformat}
*Case 1: Schema Defined *: Below Spark code with defined *schema* reads data
correctly:
{code:java}
val customSchema = StructType(Array(
StructField("dteday", DateType, true),
StructField("hr", IntegerType, true),
StructField("holiday", IntegerType, true),
StructField("weekday", IntegerType, true),
StructField("workingday", DateType, true),
StructField("weathersit", IntegerType, true),
StructField("temp", IntegerType, true),
StructField("atemp", DoubleType, true),
StructField("hum", IntegerType, true),
StructField("windspeed", IntegerType, true)));
Dataset<Row> ds = executionContext.getSparkSession().read().format( "csv" )
.option( "header", true )
.option( "schema", customSchema)
.option( "sep", "," )
.load( "input_data.csv" );
{code}
*Case 2: Schema not defined (inferSchema is used):* Incorrect data parsing is
done i.e. entire data is read as column names.
{code:java}
Dataset<Row> ds = executionContext.getSparkSession().read().format( "csv" )
.option( "header", true )
.option( "inferSchema", true)
.option( "sep", "," )
.load( "input_data.csv" );
{code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]