[
https://issues.apache.org/jira/browse/SPARK-22580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon resolved SPARK-22580.
----------------------------------
Resolution: Duplicate
> Count after filtering uncached CSV for isnull(columnNameOfCorruptRecord)
> always 0
> ---------------------------------------------------------------------------------
>
> Key: SPARK-22580
> URL: https://issues.apache.org/jira/browse/SPARK-22580
> Project: Spark
> Issue Type: Bug
> Components: Input/Output
> Affects Versions: 2.2.0
> Environment: Same behavior on Debian and MS Windows (8.1) system. JRE
> 1.8
> Reporter: Florian Kaspar
>
> It seems that doing counts after filtering for the parser-created
> columnNameOfCorruptRecord and doing a count afterwards does not recognize any
> invalid row that was put to this special column.
> Filtering for members of the actualSchema works fine and yields correct
> counts.
> Input CSV example:
> {noformat}
> val1, cat1, 1.337
> val2, cat1, 1.337
> val3, cat2, 42.0
> some, invalid, line
> {noformat}
> Code snippet:
> {code:java}
> StructType schema = new StructType(new StructField[] {
> new StructField("s1", DataTypes.StringType, true,
> Metadata.empty()),
> new StructField("s2", DataTypes.StringType, true,
> Metadata.empty()),
> new StructField("d1", DataTypes.DoubleType, true,
> Metadata.empty()),
> new StructField("FALLBACK", DataTypes.StringType, true,
> Metadata.empty())});
> Dataset<Row> csv = sqlContext.read()
> .option("header", "false")
> .option("parserLib", "univocity")
> .option("mode", "PERMISSIVE")
> .option("maxCharsPerColumn", 10000000)
> .option("ignoreLeadingWhiteSpace", "false")
> .option("ignoreTrailingWhiteSpace", "false")
> .option("comment", null)
> .option("header", "false")
> .option("columnNameOfCorruptRecord", "FALLBACK")
> .schema(schema)
> .csv(path/to/csv/file);
> long validCount = csv.filter("FALLBACK IS NULL").count();
> long invalidCount = csv.filter("FALLBACK IS NOT NULL").count();
> {code}
> Expected:
> validCount is 3
> Invalid Count is 1
> Actual:
> validCount is 4
> Invalid Count is 0
> Caching the csv after load solves the problem and shows the correct counts.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]