[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting

Ran Haim (JIRA) Sat, 19 Nov 2016 04:45:07 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15679204#comment-15679204
 ]


Ran Haim edited comment on SPARK-17436 at 11/19/16 12:44 PM:
-------------------------------------------------------------

Hi,
When you want to write your data to orc files or perquet files,
even if the dataframe is partitioned correctly, you have to tell the writer how 
to partition the data.
This means that when you want to write your data in a partitioned folder you 
lose sorting, and this is unacceptable when thinking on read performance and 
data on disk size.

I already changed the code locally, and it works as excpeted - but I have no 
permissions to create a PR, and I do not know how to get it.


was (Author: [email protected]):
Hi,
When you want to write your data to orc files or perquet files,
even if the dataframe is partitioned correctly, you have to tell the writer how 
to partition the data.
This means that when you want to write your data partitioned you lose sorting, 
and this is unacceptable when thinking on read performance and data on disk 
size.

I already changed the code locally, and it works as excpeted - but I have no 
permissions to create a PR, and I do not know how to get it.

> dataframe.write sometimes does not keep sorting
> -----------------------------------------------
>
>                 Key: SPARK-17436
>                 URL: https://issues.apache.org/jira/browse/SPARK-17436
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 1.6.1, 1.6.2, 2.0.0
>            Reporter: Ran Haim
>
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting

Reply via email to