zhengruifeng commented on a change in pull request #35250:
URL: https://github.com/apache/spark/pull/35250#discussion_r787726536
##########
File path:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
##########
@@ -1344,7 +1360,13 @@ case class Sample(
s"Sampling fraction ($fraction) must be on interval [0, 1] without
replacement")
}
- override def maxRows: Option[Long] = child.maxRows
Review comment:
I am not sure whether it is wrong, sampling with replacement should not
generate more rows than the input dataset.
But we can not impl a strict sampling with replacement, so `PoissonSampler`
is used instead, which can not guarantee this attribute.
```
scala> val df = spark.range(0, 1000)
df: org.apache.spark.sql.Dataset[Long] = [id: bigint]
scala> df.count
res0: Long = 1000
scala> df.sample(true, 0.999999, 10).count
res1: Long = 1004
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]