Fei Wang created SPARK-8968:
-------------------------------
Summary: shuffled by the partition clomns when dynamic
partitioning to optimize the memory overhead
Key: SPARK-8968
URL: https://issues.apache.org/jira/browse/SPARK-8968
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 1.4.0
Reporter: Fei Wang
now the dynamic partitioning show the bad performance for big data due to the
GC/memory overhead. this is because each task each partition now we open a
writer to write the data, this will cause many small files and high GC. We can
shuffle data by the partition columns so that each partition will have ony one
partition file and this also reduce the gc overhead
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]