[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting

Ran Haim (JIRA) Mon, 14 Nov 2016 07:46:22 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15664233#comment-15664233
 ]


Ran Haim edited comment on SPARK-17436 at 11/14/16 3:45 PM:
------------------------------------------------------------

So...
Can you help?
Or at least point me to someone who can?

BTW On spark-core I get test errors like these:
org.apache.spark.JavaAPISuite.map(org.apache.spark.JavaAPISuite)
  Run 1: JavaAPISuite.setUp:85 » NoClassDefFound Could not initialize class 
org.apache....
  Run 2: JavaAPISuite.tearDown:92 NullPointer

I think it has something to do with this error that I also see:
Error while locating file spark-version-info.properties


was (Author: [email protected]):
So...
Can you help?
Or at least point me to someone who can?

BTW On spark-core I get test errors like these:
org.apache.spark.JavaAPISuite.map(org.apache.spark.JavaAPISuite)
  Run 1: JavaAPISuite.setUp:85 » NoClassDefFound Could not initialize class 
org.apache....
  Run 2: JavaAPISuite.tearDown:92 NullPointer

> dataframe.write sometimes does not keep sorting
> -----------------------------------------------
>
>                 Key: SPARK-17436
>                 URL: https://issues.apache.org/jira/browse/SPARK-17436
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 1.6.1, 1.6.2, 2.0.0
>            Reporter: Ran Haim
>
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting

Reply via email to