[ 
https://issues.apache.org/jira/browse/SPARK-23899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16620230#comment-16620230
 ] 

Georg Heiler commented on SPARK-23899:
--------------------------------------

What about repartitioning by complex types, i.e. size of array? 
[https://stackoverflow.com/questions/46240688/how-to-equally-partition-array-data-in-spark-dataframe]
 

Assuming n records of data frames is almost constant but m observations define 
the real computational complexity a regular repartition will only ensure 
roughly equal amounts of n records per partition not considering the size of 
the array. 

 

Ideally, I would want to make sure that especially arrays with many elements do 
not end up in the same partition in order to prevent data skew.

> Built-in SQL Function Improvement
> ---------------------------------
>
>                 Key: SPARK-23899
>                 URL: https://issues.apache.org/jira/browse/SPARK-23899
>             Project: Spark
>          Issue Type: Umbrella
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Xiao Li
>            Priority: Major
>             Fix For: 2.4.0
>
>
> This umbrella JIRA is to improve compatibility with the other data processing 
> systems, including Hive, Teradata, Presto, Postgres, MySQL, DB2, Oracle, and 
> MS SQL Server.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to