[
https://issues.apache.org/jira/browse/SPARK-39605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562330#comment-17562330
]
Hyukjin Kwon commented on SPARK-39605:
--------------------------------------
The exception is from MongoDB. I suspect this is a problem from that connector.
It would be great to see where/what's the issue from Apache Spark.
> PySpark df.count() operation works fine on DBR 7.3 LTS but fails in DBR 10.4
> LTS
> --------------------------------------------------------------------------------
>
> Key: SPARK-39605
> URL: https://issues.apache.org/jira/browse/SPARK-39605
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 3.2.1
> Reporter: Manoj Chandrashekar
> Priority: Major
> Attachments: image-2022-06-27-11-00-50-119.png
>
>
> I have a job that infers schema from mongodb and does operations such as
> flattening and unwinding because there are nested fields. After performing
> various transformations, finally when the count() is performed, it works
> perfectly fine in databricks runtime version 7.3 LTS but fails to perform the
> same in 10.4 LTS.
> *Below is the image that shows successful run in 7.3 LTS:*
> !https://docs.microsoft.com/answers/storage/attachments/215035-image.png|width=630,height=75!
> *Below is the image that shows failure in 10.4 LTS:*
> !image-2022-06-27-11-00-50-119.png|width=624,height=64!
> And I have validated that there is no field in our schema that has NullType.
> In fact when the schema was inferred, there were Null & void type fields
> which were converted to string using my UDF. This issue will persists even
> when I infer schema on complete dataset, that is, samplePoolSize is on full
> data set.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]