Thomas Diesler created SPARK-29068:
--------------------------------------
Summary: CSV read reports incorrect row count
Key: SPARK-29068
URL: https://issues.apache.org/jira/browse/SPARK-29068
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 2.4.4
Reporter: Thomas Diesler
Reading the [SFNY example
data|https://github.com/jadeyee/r2d3-part-1-data/blob/master/part_1_data.csv]
in Java like this ...
{code:java}
Path srcdir = Paths.get("src/test/resources");
Path inpath = srcdir.resolve("part_1_data.csv");
SparkSession session = getOrCreateSession();
Dataset<Row> dataset = session.read()
//.option("header", true)
.option("mode", "DROPMALFORMED")
.schema(new StructType()
.add("insf", IntegerType, false)
.add("beds", DoubleType, false)
.add("baths", DoubleType, false)
.add("price", IntegerType, false)
.add("year", IntegerType, false)
.add("sqft", IntegerType, false)
.add("prcsqft", IntegerType, false)
.add("elevation", IntegerType, false))
.csv(inpath.toString());
{code}
Incorrectly reports 495 instead of 492 rows. It seems to include the three
header rows in the count.
Also, without DROPMALFORMED it creates 495 rows with three null value rows.
This also seems to be incorrect because the schema explicitly requires non null
values for all fields.
This code works fine with Spark-2.1.0
--
This message was sent by Atlassian Jira
(v8.3.2#803003)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]