[jira] [Updated] (SPARK-36277) Issue with record count of data frame while reading in DropMalformed mode

anju (Jira) Fri, 23 Jul 2021 13:45:06 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-36277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


anju updated SPARK-36277:
-------------------------
    Description: 
While reading the dataframe in malformed mode ,I am not getting right record 
count. dataframe.count() is giving me the record count of actual file including 
malformed records, eventhough data frame is read in "dropmalformed" mode. Is 
there a way to overcome this in pyspark
 here is the high level overview of what i am doing I am trying to read the two 
dataframes from one file using with/without predefined schema. Issue is when i 
read a DF with a predefined schema and with mode as "dropmalformed", the record 
count in  df is not dropping the records. The record count is same as actual 
file where i am expecting less record count,as there are few malformed records 
. But when i try to select and display the records in df ,it is not showing 
malformed records. So display is correct. output is attached in the aattchment

code 
  
 {{s3_obj =boto3.client('s3')
 s3_clientobj = s3_obj.get_object(Bucket='xyz', 
Key='data/test_files/schema_xyz.json')
 s3_clientobj
 s3_clientdata = 
s3_clientobj['Body'].read().decode('utf-8')#print(s3_clientdata)schemaSource=json.loads(s3_clientdata)
 schemaFromJson =StructType.fromJson(json.loads(s3_clientdata))

extract_with_schema_df = 
spark.read.csv("s3:few_columns.csv",header=True,sep=",",schema=schemaFromJson,mode="DROPMALFORMED")

extract_without_schema_df = 
spark.read.csv("s3:few_columns.csv",header=True,sep=",",mode="permissive")

extract_with_schema_df.select("col1","col2").show()
 cnt1=extract_with_schema_df.select("col1","col2").count()print("count of the 
records with schema "+ str(cnt1))
 cnt2=extract_without_schema_df.select("col1","col2").count()print("count of 
the records without schema "+str(cnt2))

cnt2=extract_without_schema_df.select("col1","col2").show()}}

 

  was:
 I am trying to read the two dataframes from one file using with/without 
predefined schema. Issue is when i read a DF with a predefined schema and with 
mode as "dropmalformed", the record count in  df is not dropping the records. 
The record count is same as actual file where i am expecting less record 
count,as there are few malformed records . But when i try to select and display 
the records in df ,it is not showing malformed records. So display is correct. 
output is attached in the aattchement

code 
 
{{s3_obj =boto3.client('s3')
s3_clientobj = s3_obj.get_object(Bucket='xyz', 
Key='data/test_files/schema_xyz.json')
s3_clientobj
s3_clientdata = 
s3_clientobj['Body'].read().decode('utf-8')#print(s3_clientdata)schemaSource=json.loads(s3_clientdata)
schemaFromJson =StructType.fromJson(json.loads(s3_clientdata))

extract_with_schema_df = 
spark.read.csv("s3:few_columns.csv",header=True,sep=",",schema=schemaFromJson,mode="DROPMALFORMED")
       
extract_without_schema_df = 
spark.read.csv("s3:few_columns.csv",header=True,sep=",",mode="permissive")

extract_with_schema_df.select("col1","col2").show()
cnt1=extract_with_schema_df.select("col1","col2").count()print("count of the 
records with schema "+ str(cnt1))
cnt2=extract_without_schema_df.select("col1","col2").count()print("count of the 
records without schema "+str(cnt2))

cnt2=extract_without_schema_df.select("col1","col2").show()}}

 


> Issue with record count of data frame while reading in DropMalformed mode
> -------------------------------------------------------------------------
>
>                 Key: SPARK-36277
>                 URL: https://issues.apache.org/jira/browse/SPARK-36277
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.4.3
>            Reporter: anju
>            Priority: Major
>         Attachments: 111.PNG
>
>
> While reading the dataframe in malformed mode ,I am not getting right record 
> count. dataframe.count() is giving me the record count of actual file 
> including malformed records, eventhough data frame is read in "dropmalformed" 
> mode. Is there a way to overcome this in pyspark
>  here is the high level overview of what i am doing I am trying to read the 
> two dataframes from one file using with/without predefined schema. Issue is 
> when i read a DF with a predefined schema and with mode as "dropmalformed", 
> the record count in  df is not dropping the records. The record count is same 
> as actual file where i am expecting less record count,as there are few 
> malformed records . But when i try to select and display the records in df 
> ,it is not showing malformed records. So display is correct. output is 
> attached in the aattchment
> code 
>   
>  {{s3_obj =boto3.client('s3')
>  s3_clientobj = s3_obj.get_object(Bucket='xyz', 
> Key='data/test_files/schema_xyz.json')
>  s3_clientobj
>  s3_clientdata = 
> s3_clientobj['Body'].read().decode('utf-8')#print(s3_clientdata)schemaSource=json.loads(s3_clientdata)
>  schemaFromJson =StructType.fromJson(json.loads(s3_clientdata))
> extract_with_schema_df = 
> spark.read.csv("s3:few_columns.csv",header=True,sep=",",schema=schemaFromJson,mode="DROPMALFORMED")
> extract_without_schema_df = 
> spark.read.csv("s3:few_columns.csv",header=True,sep=",",mode="permissive")
> extract_with_schema_df.select("col1","col2").show()
>  cnt1=extract_with_schema_df.select("col1","col2").count()print("count of the 
> records with schema "+ str(cnt1))
>  cnt2=extract_without_schema_df.select("col1","col2").count()print("count of 
> the records without schema "+str(cnt2))
> cnt2=extract_without_schema_df.select("col1","col2").show()}}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-36277) Issue with record count of data frame while reading in DropMalformed mode

Reply via email to