Andrew Gross created SPARK-21768:
------------------------------------
Summary: spark.csv.read Empty String Parsed as NULL when nullValue
is Set
Key: SPARK-21768
URL: https://issues.apache.org/jira/browse/SPARK-21768
Project: Spark
Issue Type: Bug
Components: PySpark, SQL
Affects Versions: 2.2.0, 2.0.2
Environment: AWS EMR Spark 2.2.0 (also Spark 2.0.2)
PySpark
Reporter: Andrew Gross
In a CSV with quoted fields, empty strings will be interpreted as NULL even
when a nullValue is explicitly set:
Example CSV with Quoted Fields, Delimiter | and nullValue XXNULLXX
{{"XXNULLXX"|""|"XXNULLXX"|"foo"}}
PySpark Script to load the file (from S3):
{code:title=load.py|borderStyle=solid}
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType, StructField, StructType
spark = SparkSession.builder.appName("test_csv").getOrCreate()
fields = []
fields.append(StructField("First Null Field", StringType(), True))
fields.append(StructField("Empty String Field", StringType(), True))
fields.append(StructField("Second Null Field", StringType(), True))
fields.append(StructField("Non Empty String Field", StringType(), True))
schema = StructType(fields)
keys = ['s3://mybucket/test/demo.csv']
bad_data = spark.read.csv(keys, timestampFormat="yyyy-MM-dd HH:mm:ss",
mode="FAILFAST", sep="|", nullValue="XXNULLXX", schema=schema)
bad_data.show()
{code}
Output
{noformat}
+----------------+------------------+-----------------+----------------------+
|First Null Field|Empty String Field|Second Null Field|Non Empty String Field|
+----------------+------------------+-----------------+----------------------+
| null| null| null| foo|
+----------------+------------------+-----------------+----------------------+
{noformat}
Expected Output:
{noformat}
+----------------+------------------+-----------------+----------------------+
|First Null Field|Empty String Field|Second Null Field|Non Empty String Field|
+----------------+------------------+-----------------+----------------------+
| null| | null| foo|
+----------------+------------------+-----------------+----------------------+
{noformat}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]