[
https://issues.apache.org/jira/browse/SPARK-24147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon resolved SPARK-24147.
----------------------------------
Resolution: Duplicate
> .count() reports wrong size of dataframe when filtering dataframe on corrupt
> record field
> -----------------------------------------------------------------------------------------
>
> Key: SPARK-24147
> URL: https://issues.apache.org/jira/browse/SPARK-24147
> Project: Spark
> Issue Type: Bug
> Components: PySpark, Spark Core
> Affects Versions: 2.2.1
> Environment: Spark version 2.2.1
> Pyspark
> Python version 3.6.4
> Reporter: Rich Smith
> Priority: Major
>
> Spark reports the wrong size of dataframe using .count() after filtering on a
> corruptField field.
> Example file that shows the problem:
>
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.types import StringType, StructType, StructField, DoubleType
> from pyspark.sql.functions import col, lit
> spark =
> SparkSession.builder.master("local[3]").appName("pyspark-unittest").getOrCreate()
> spark.conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
> SCHEMA = StructType([
> StructField("headerDouble", DoubleType(), False),
> StructField("ErrorField", StringType(), False)
> ])
> dataframe = (
> spark.read
> .option("header", "true")
> .option("mode", "PERMISSIVE")
> .option("columnNameOfCorruptRecord", "ErrorField")
> .schema(SCHEMA).csv("./x.csv")
> )
> total_row_count = dataframe.count()
> print("total_row_count = " + str(total_row_count))
> errors = dataframe.filter(col("ErrorField").isNotNull())
> errors.show()
> error_count = errors.count()
> print("errors count = " + str(error_count))
> {code}
>
>
> Using input file x.csv:
>
> {code:java}
> headerDouble
> wrong
> {code}
>
>
> Output text. As shown, contents of dataframe contains a row, but .count()
> reports 0.
>
> {code:java}
> total_row_count = 1
> +------------+----------+
> |headerDouble|ErrorField|
> +------------+----------+
> | null| wrong|
> +------------+----------+
> errors count = 0
> {code}
>
>
> Also discussed briefly on StackOverflow:
> [https://stackoverflow.com/questions/50121899/how-can-sparks-count-function-be-different-to-the-contents-of-the-dataframe]
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]