[
https://issues.apache.org/jira/browse/SPARK-15420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael Armbrust updated SPARK-15420:
-------------------------------------
Target Version/s: 2.3.0 (was: 2.2.0)
> Repartition and sort before Parquet writes
> ------------------------------------------
>
> Key: SPARK-15420
> URL: https://issues.apache.org/jira/browse/SPARK-15420
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.6.1
> Reporter: Ryan Blue
>
> Parquet requires buffering data in memory before writing a group of rows
> organized by column. This causes significant memory pressure when writing
> partitioned output because each open file must buffer rows.
> Currently, Spark will sort data and spill if necessary in the
> {{WriterContainer}} to avoid keeping many files open at once. But, this isn't
> a full solution for a few reasons:
> * The final sort is always performed, even if incoming data is already sorted
> correctly. For example, a global sort will cause two sorts to happen, even if
> the global sort correctly prepares the data.
> * To prevent a large number of output small output files, users must manually
> add a repartition step. That step is also ignored by the sort within the
> writer.
> * Hive does not currently support {{DataFrameWriter#sortBy}}
> The sort in {{WriterContainer}} makes sense to prevent problems, but should
> detect if the incoming data is already sorted. The {{DataFrameWriter}} should
> also expose the ability to repartition data before the write stage, and the
> query planner should expose an option to automatically insert repartition
> operations.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]