Ryan Blue created SPARK-15420:
---------------------------------

             Summary: Repartition and sort before Parquet writes
                 Key: SPARK-15420
                 URL: https://issues.apache.org/jira/browse/SPARK-15420
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.6.1
            Reporter: Ryan Blue


Parquet requires buffering data in memory before writing a group of rows 
organized by column. This causes significant memory pressure when writing 
partitioned output because each open file must buffer rows.

Currently, Spark will sort data and spill if necessary in the 
{{WriterContainer}} to avoid keeping many files open at once. But, this isn't a 
full solution for a few reasons:
* The final sort is always performed, even if incoming data is already sorted 
correctly. For example, a global sort will cause two sorts to happen, even if 
the global sort correctly prepares the data.
* To prevent a large number of output small output files, users must manually 
add a repartition step. That step is also ignored by the sort within the writer.
* Hive does not currently support {{DataFrameWriter#sortBy}}

The sort in {{WriterContainer}} makes sense to prevent problems, but should 
detect if the incoming data is already sorted. The {{DataFrameWriter}} should 
also expose the ability to repartition data before the write stage, and the 
query planner should expose an option to automatically insert repartition 
operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to