Ryan Blue created SPARK-15420:
---------------------------------
Summary: Repartition and sort before Parquet writes
Key: SPARK-15420
URL: https://issues.apache.org/jira/browse/SPARK-15420
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 1.6.1
Reporter: Ryan Blue
Parquet requires buffering data in memory before writing a group of rows
organized by column. This causes significant memory pressure when writing
partitioned output because each open file must buffer rows.
Currently, Spark will sort data and spill if necessary in the
{{WriterContainer}} to avoid keeping many files open at once. But, this isn't a
full solution for a few reasons:
* The final sort is always performed, even if incoming data is already sorted
correctly. For example, a global sort will cause two sorts to happen, even if
the global sort correctly prepares the data.
* To prevent a large number of output small output files, users must manually
add a repartition step. That step is also ignored by the sort within the writer.
* Hive does not currently support {{DataFrameWriter#sortBy}}
The sort in {{WriterContainer}} makes sense to prevent problems, but should
detect if the incoming data is already sorted. The {{DataFrameWriter}} should
also expose the ability to repartition data before the write stage, and the
query planner should expose an option to automatically insert repartition
operations.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]