[jira] [Commented] (SPARK-39605) PySpark df.count() operation works fine on DBR 7.3 LTS but fails in DBR 10.4 LTS

Hyukjin Kwon (Jira) Mon, 04 Jul 2022 19:53:05 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-39605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562330#comment-17562330
 ]


Hyukjin Kwon commented on SPARK-39605:
--------------------------------------

The exception is from MongoDB. I suspect this is a problem from that connector. 
It would be great to see where/what's the issue from Apache Spark.

> PySpark df.count() operation works fine on DBR 7.3 LTS but fails in DBR 10.4 
> LTS
> --------------------------------------------------------------------------------
>
>                 Key: SPARK-39605
>                 URL: https://issues.apache.org/jira/browse/SPARK-39605
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.2.1
>            Reporter: Manoj Chandrashekar
>            Priority: Major
>         Attachments: image-2022-06-27-11-00-50-119.png
>
>
> I have a job that infers schema from mongodb and does operations such as 
> flattening and unwinding because there are nested fields. After performing 
> various transformations, finally when the count() is performed, it works 
> perfectly fine in databricks runtime version 7.3 LTS but fails to perform the 
> same in 10.4 LTS.
> *Below is the image that shows successful run in 7.3 LTS:*
> !https://docs.microsoft.com/answers/storage/attachments/215035-image.png|width=630,height=75!
> *Below is the image that shows failure in 10.4 LTS:*
> !image-2022-06-27-11-00-50-119.png|width=624,height=64!
> And I have validated that there is no field in our schema that has NullType. 
> In fact when the schema was inferred, there were Null & void type fields 
> which were converted to string using my UDF. This issue will persists even 
> when I infer schema on complete dataset, that is, samplePoolSize is on full 
> data set.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-39605) PySpark df.count() operation works fine on DBR 7.3 LTS but fails in DBR 10.4 LTS

Reply via email to