[
https://issues.apache.org/jira/browse/SPARK-23899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16620230#comment-16620230
]
Georg Heiler commented on SPARK-23899:
--------------------------------------
What about repartitioning by complex types, i.e. size of array?
[https://stackoverflow.com/questions/46240688/how-to-equally-partition-array-data-in-spark-dataframe]
Assuming n records of data frames is almost constant but m observations define
the real computational complexity a regular repartition will only ensure
roughly equal amounts of n records per partition not considering the size of
the array.
Ideally, I would want to make sure that especially arrays with many elements do
not end up in the same partition in order to prevent data skew.
> Built-in SQL Function Improvement
> ---------------------------------
>
> Key: SPARK-23899
> URL: https://issues.apache.org/jira/browse/SPARK-23899
> Project: Spark
> Issue Type: Umbrella
> Components: SQL
> Affects Versions: 2.3.0
> Reporter: Xiao Li
> Priority: Major
> Fix For: 2.4.0
>
>
> This umbrella JIRA is to improve compatibility with the other data processing
> systems, including Hive, Teradata, Presto, Postgres, MySQL, DB2, Oracle, and
> MS SQL Server.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]