Tarique Anwer created SPARK-40584:
-------------------------------------
Summary: Incorrect Count when reading CSV file
Key: SPARK-40584
URL: https://issues.apache.org/jira/browse/SPARK-40584
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 3.1.2
Reporter: Tarique Anwer
I'm trying to read the below data from a CSV file and end up with a wrong
count, although the dataframe contains all the records below.
df_inputfile.count() prints 3 although it should have been 4.
{code:java}
B1123451020-502,"","{""m"": {""difference"":
60}}","","","",2022-02-12T15:40:00.783Z
B1456741975-266,"","{""m"": {""difference"":
60}}","","","",2022-02-04T17:03:59.566Z
B1789753479-460,"","",",","","",2022-02-18T14:46:57.332Z
B1456741977-123,"","{""m"": {""difference"":
60}}","","","",2022-02-04T17:03:59.566Z {code}
Here's the code:
{code:java}
df_inputfile = (spark.read.format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("header","false")
.option("quotedstring",'\"')
.option("escape",'\"')
.option("multiline","true")
.option("delimiter",",")
.load('<path to csv>'))
print(df_inputfile.count()) # Prints 3
print(df_inputfile.distinct().count()) # Prints 4 {code}
Adding a cache() statement before the count results in correct output. Removing
the option 'escape' also results in a correct count.
{noformat}
option("escape",'\"'){noformat}
It looks like this is happening because of the single comma in the 4th column
of the 3rd row. Can someone please explain what's going on?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]