[ https://issues.apache.org/jira/browse/SPARK-23899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16620230#comment-16620230 ]
Georg Heiler commented on SPARK-23899: -------------------------------------- What about repartitioning by complex types, i.e. size of array? [https://stackoverflow.com/questions/46240688/how-to-equally-partition-array-data-in-spark-dataframe] Assuming n records of data frames is almost constant but m observations define the real computational complexity a regular repartition will only ensure roughly equal amounts of n records per partition not considering the size of the array. Ideally, I would want to make sure that especially arrays with many elements do not end up in the same partition in order to prevent data skew. > Built-in SQL Function Improvement > --------------------------------- > > Key: SPARK-23899 > URL: https://issues.apache.org/jira/browse/SPARK-23899 > Project: Spark > Issue Type: Umbrella > Components: SQL > Affects Versions: 2.3.0 > Reporter: Xiao Li > Priority: Major > Fix For: 2.4.0 > > > This umbrella JIRA is to improve compatibility with the other data processing > systems, including Hive, Teradata, Presto, Postgres, MySQL, DB2, Oracle, and > MS SQL Server. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org