Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/22219#discussion_r212664399 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala --- @@ -348,30 +350,30 @@ abstract class SparkPlan extends QueryPlan[SparkPlan] with Logging with Serializ // Otherwise, interpolate the number of partitions we need to try, but overestimate // it by 50%. We also cap the estimation in the end. val limitScaleUpFactor = Math.max(sqlContext.conf.limitScaleUpFactor, 2) - if (buf.isEmpty) { + if (scannedRowCount == 0) { numPartsToTry = partsScanned * limitScaleUpFactor } else { - val left = n - buf.size + val left = n - scannedRowCount // As left > 0, numPartsToTry is always >= 1 - numPartsToTry = Math.ceil(1.5 * left * partsScanned / buf.size).toInt + numPartsToTry = Math.ceil(1.5 * left * partsScanned / scannedRowCount).toInt --- End diff -- nit: Is it better to define a variable as `val`?
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org