[
https://issues.apache.org/jira/browse/SPARK-36277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
anju updated SPARK-36277:
-------------------------
Description:
While reading the dataframe in malformed mode ,I am not getting right record
count. dataframe.count() is giving me the record count of actual file including
malformed records, eventhough data frame is read in "dropmalformed" mode. Is
there a way to overcome this in pyspark
here is the high level overview of what i am doing I am trying to read the two
dataframes from one file using with/without predefined schema. Issue is when i
read a DF with a predefined schema and with mode as "dropmalformed", the record
count in df is not dropping the records. The record count is same as actual
file where i am expecting less record count,as there are few malformed records
. But when i try to select and display the records in df ,it is not showing
malformed records. So display is correct. output is attached in the aattchment
code
{{s3_obj =boto3.client('s3')
s3_clientobj = s3_obj.get_object(Bucket='xyz',
Key='data/test_files/schema_xyz.json')
s3_clientobj
s3_clientdata =
s3_clientobj['Body'].read().decode('utf-8')#print(s3_clientdata)schemaSource=json.loads(s3_clientdata)
schemaFromJson =StructType.fromJson(json.loads(s3_clientdata))
extract_with_schema_df =
spark.read.csv("s3:few_columns.csv",header=True,sep=",",schema=schemaFromJson,mode="DROPMALFORMED")
extract_without_schema_df =
spark.read.csv("s3:few_columns.csv",header=True,sep=",",mode="permissive")
extract_with_schema_df.select("col1","col2").show()
cnt1=extract_with_schema_df.select("col1","col2").count()print("count of the
records with schema "+ str(cnt1))
cnt2=extract_without_schema_df.select("col1","col2").count()print("count of
the records without schema "+str(cnt2))
cnt2=extract_without_schema_df.select("col1","col2").show()}}
was:
I am trying to read the two dataframes from one file using with/without
predefined schema. Issue is when i read a DF with a predefined schema and with
mode as "dropmalformed", the record count in df is not dropping the records.
The record count is same as actual file where i am expecting less record
count,as there are few malformed records . But when i try to select and display
the records in df ,it is not showing malformed records. So display is correct.
output is attached in the aattchement
code
{{s3_obj =boto3.client('s3')
s3_clientobj = s3_obj.get_object(Bucket='xyz',
Key='data/test_files/schema_xyz.json')
s3_clientobj
s3_clientdata =
s3_clientobj['Body'].read().decode('utf-8')#print(s3_clientdata)schemaSource=json.loads(s3_clientdata)
schemaFromJson =StructType.fromJson(json.loads(s3_clientdata))
extract_with_schema_df =
spark.read.csv("s3:few_columns.csv",header=True,sep=",",schema=schemaFromJson,mode="DROPMALFORMED")
extract_without_schema_df =
spark.read.csv("s3:few_columns.csv",header=True,sep=",",mode="permissive")
extract_with_schema_df.select("col1","col2").show()
cnt1=extract_with_schema_df.select("col1","col2").count()print("count of the
records with schema "+ str(cnt1))
cnt2=extract_without_schema_df.select("col1","col2").count()print("count of the
records without schema "+str(cnt2))
cnt2=extract_without_schema_df.select("col1","col2").show()}}
> Issue with record count of data frame while reading in DropMalformed mode
> -------------------------------------------------------------------------
>
> Key: SPARK-36277
> URL: https://issues.apache.org/jira/browse/SPARK-36277
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 2.4.3
> Reporter: anju
> Priority: Major
> Attachments: 111.PNG
>
>
> While reading the dataframe in malformed mode ,I am not getting right record
> count. dataframe.count() is giving me the record count of actual file
> including malformed records, eventhough data frame is read in "dropmalformed"
> mode. Is there a way to overcome this in pyspark
> here is the high level overview of what i am doing I am trying to read the
> two dataframes from one file using with/without predefined schema. Issue is
> when i read a DF with a predefined schema and with mode as "dropmalformed",
> the record count in df is not dropping the records. The record count is same
> as actual file where i am expecting less record count,as there are few
> malformed records . But when i try to select and display the records in df
> ,it is not showing malformed records. So display is correct. output is
> attached in the aattchment
> code
>
> {{s3_obj =boto3.client('s3')
> s3_clientobj = s3_obj.get_object(Bucket='xyz',
> Key='data/test_files/schema_xyz.json')
> s3_clientobj
> s3_clientdata =
> s3_clientobj['Body'].read().decode('utf-8')#print(s3_clientdata)schemaSource=json.loads(s3_clientdata)
> schemaFromJson =StructType.fromJson(json.loads(s3_clientdata))
> extract_with_schema_df =
> spark.read.csv("s3:few_columns.csv",header=True,sep=",",schema=schemaFromJson,mode="DROPMALFORMED")
> extract_without_schema_df =
> spark.read.csv("s3:few_columns.csv",header=True,sep=",",mode="permissive")
> extract_with_schema_df.select("col1","col2").show()
> cnt1=extract_with_schema_df.select("col1","col2").count()print("count of the
> records with schema "+ str(cnt1))
> cnt2=extract_without_schema_df.select("col1","col2").count()print("count of
> the records without schema "+str(cnt2))
> cnt2=extract_without_schema_df.select("col1","col2").show()}}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]