srowen opened a new pull request #24226: [SPARK-26660][FOLLOWUP] Raise task serialized size warning threshold to 1000 KiB URL: https://github.com/apache/spark/pull/24226 ## What changes were proposed in this pull request? Raise the threshold size for serialized task size at which a warning is generated from 100KiB to 1000KiB. As several people have noted, the original change for this JIRA highlighted that this threshold is low. Test output regularly shows: ``` - sorting on StringType with nullable=false, sortOrder=List('a DESC NULLS LAST) 22:47:53.320 WARN org.apache.spark.scheduler.TaskSetManager: Stage 80 contains a task of very large size (755 KiB). The maximum recommended task size is 100 KiB. 22:47:53.348 WARN org.apache.spark.scheduler.TaskSetManager: Stage 81 contains a task of very large size (755 KiB). The maximum recommended task size is 100 KiB. 22:47:53.417 WARN org.apache.spark.scheduler.TaskSetManager: Stage 83 contains a task of very large size (755 KiB). The maximum recommended task size is 100 KiB. 22:47:53.444 WARN org.apache.spark.scheduler.TaskSetManager: Stage 84 contains a task of very large size (755 KiB). The maximum recommended task size is 100 KiB. ... - SPARK-20688: correctly check analysis for scalar sub-queries 22:49:10.314 WARN org.apache.spark.scheduler.DAGScheduler: Broadcasting large task binary with size 150.8 KiB - SPARK-21835: Join in correlated subquery should be duplicateResolved: case 1 22:49:10.595 WARN org.apache.spark.scheduler.DAGScheduler: Broadcasting large task binary with size 150.7 KiB 22:49:10.744 WARN org.apache.spark.scheduler.DAGScheduler: Broadcasting large task binary with size 150.7 KiB 22:49:10.894 WARN org.apache.spark.scheduler.DAGScheduler: Broadcasting large task binary with size 150.7 KiB - SPARK-21835: Join in correlated subquery should be duplicateResolved: case 2 - SPARK-21835: Join in correlated subquery should be duplicateResolved: case 3 - SPARK-23316: AnalysisException after max iteration reached for IN query 22:49:11.559 WARN org.apache.spark.scheduler.DAGScheduler: Broadcasting large task binary with size 154.2 KiB ``` It seems that a larger threshold of about 1MB is more suitable. ## How was this patch tested? Existing tests.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
