[jira] [Commented] (SPARK-21595) introduction of spark.sql.windowExec.buffer.spill.threshold in spark 2.2 breaks existing workflow

Tejas Patil (JIRA) Tue, 01 Aug 2017 17:41:41 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-21595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16110053#comment-16110053
 ]


Tejas Patil commented on SPARK-21595:
-------------------------------------

This config was introduced by me in SPARK-13450. The reason why 4096 was used 
is because before the change it was using 4096 as threshold to switch to 
`UnsafeExternalSorter` (see WindowExec.scala in 
https://github.com/apache/spark/pull/16909/files). I don't have real workloads 
which use WINDOW operator so would defer from proposing a value but I am open 
to change the default value to something that works well for everyone. After 
you bumped up the config, how many files were generated ? I want to know what 
value would effectively create the same number of files as spark 2.1 did.

> introduction of spark.sql.windowExec.buffer.spill.threshold in spark 2.2 
> breaks existing workflow
> -------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-21595
>                 URL: https://issues.apache.org/jira/browse/SPARK-21595
>             Project: Spark
>          Issue Type: Bug
>          Components: Documentation, PySpark
>    Affects Versions: 2.2.0
>         Environment: pyspark on linux
>            Reporter: Stephan Reiling
>            Priority: Minor
>              Labels: documentation, regression
>
> My pyspark code has the following statement:
> {code:java}
> # assign row key for tracking
> df = df.withColumn(
>         'association_idx',
>         sqlf.row_number().over(
>             Window.orderBy('uid1', 'uid2')
>         )
>     )
> {code}
> where df is a long, skinny (450M rows, 10 columns) dataframe. So this creates 
> one large window for the whole dataframe to sort over.
> In spark 2.1 this works without problem, in spark 2.2 this fails either with 
> out of memory exception or too many open files exception, depending on memory 
> settings (which is what I tried first to fix this).
> Monitoring the blockmgr, I see that spark 2.1 creates 152 files, spark 2.2 
> creates >110,000 files.
> In the log I see the following messages (110,000 of these):
> {noformat}
> 17/08/01 08:55:37 INFO UnsafeExternalSorter: Spilling data because number of 
> spilledRecords crossed the threshold 4096
> 17/08/01 08:55:37 INFO UnsafeExternalSorter: Thread 156 spilling sort data of 
> 64.1 MB to disk (0  time so far)
> 17/08/01 08:55:37 INFO UnsafeExternalSorter: Spilling data because number of 
> spilledRecords crossed the threshold 4096
> 17/08/01 08:55:37 INFO UnsafeExternalSorter: Thread 156 spilling sort data of 
> 64.1 MB to disk (1  time so far)
> {noformat}
> So I started hunting for clues in UnsafeExternalSorter, without luck. What I 
> had missed was this one message:
> {noformat}
> 17/08/01 08:55:37 INFO ExternalAppendOnlyUnsafeRowArray: Reached spill 
> threshold of 4096 rows, switching to 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter
> {noformat}
> Which allowed me to track down the issue. 
> By changing the configuration to include:
> {code:java}
> spark.sql.windowExec.buffer.spill.threshold   2097152
> {code}
> I got it to work again and with the same performance as spark 2.1.
> I have workflows where I use windowing functions that do not fail, but took a 
> performance hit due to the excessive spilling when using the default of 4096.
> I think to make it easier to track down these issues this config variable 
> should be included in the configuration documentation. 
> Maybe 4096 is too small of a default value?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-21595) introduction of spark.sql.windowExec.buffer.spill.threshold in spark 2.2 breaks existing workflow

Reply via email to