[
https://issues.apache.org/jira/browse/SPARK-21595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16110053#comment-16110053
]
Tejas Patil commented on SPARK-21595:
-------------------------------------
This config was introduced by me in SPARK-13450. The reason why 4096 was used
is because before the change it was using 4096 as threshold to switch to
`UnsafeExternalSorter` (see WindowExec.scala in
https://github.com/apache/spark/pull/16909/files). I don't have real workloads
which use WINDOW operator so would defer from proposing a value but I am open
to change the default value to something that works well for everyone. After
you bumped up the config, how many files were generated ? I want to know what
value would effectively create the same number of files as spark 2.1 did.
> introduction of spark.sql.windowExec.buffer.spill.threshold in spark 2.2
> breaks existing workflow
> -------------------------------------------------------------------------------------------------
>
> Key: SPARK-21595
> URL: https://issues.apache.org/jira/browse/SPARK-21595
> Project: Spark
> Issue Type: Bug
> Components: Documentation, PySpark
> Affects Versions: 2.2.0
> Environment: pyspark on linux
> Reporter: Stephan Reiling
> Priority: Minor
> Labels: documentation, regression
>
> My pyspark code has the following statement:
> {code:java}
> # assign row key for tracking
> df = df.withColumn(
> 'association_idx',
> sqlf.row_number().over(
> Window.orderBy('uid1', 'uid2')
> )
> )
> {code}
> where df is a long, skinny (450M rows, 10 columns) dataframe. So this creates
> one large window for the whole dataframe to sort over.
> In spark 2.1 this works without problem, in spark 2.2 this fails either with
> out of memory exception or too many open files exception, depending on memory
> settings (which is what I tried first to fix this).
> Monitoring the blockmgr, I see that spark 2.1 creates 152 files, spark 2.2
> creates >110,000 files.
> In the log I see the following messages (110,000 of these):
> {noformat}
> 17/08/01 08:55:37 INFO UnsafeExternalSorter: Spilling data because number of
> spilledRecords crossed the threshold 4096
> 17/08/01 08:55:37 INFO UnsafeExternalSorter: Thread 156 spilling sort data of
> 64.1 MB to disk (0 time so far)
> 17/08/01 08:55:37 INFO UnsafeExternalSorter: Spilling data because number of
> spilledRecords crossed the threshold 4096
> 17/08/01 08:55:37 INFO UnsafeExternalSorter: Thread 156 spilling sort data of
> 64.1 MB to disk (1 time so far)
> {noformat}
> So I started hunting for clues in UnsafeExternalSorter, without luck. What I
> had missed was this one message:
> {noformat}
> 17/08/01 08:55:37 INFO ExternalAppendOnlyUnsafeRowArray: Reached spill
> threshold of 4096 rows, switching to
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter
> {noformat}
> Which allowed me to track down the issue.
> By changing the configuration to include:
> {code:java}
> spark.sql.windowExec.buffer.spill.threshold 2097152
> {code}
> I got it to work again and with the same performance as spark 2.1.
> I have workflows where I use windowing functions that do not fail, but took a
> performance hit due to the excessive spilling when using the default of 4096.
> I think to make it easier to track down these issues this config variable
> should be included in the configuration documentation.
> Maybe 4096 is too small of a default value?
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]