zhengruifeng commented on a change in pull request #35250:
URL: https://github.com/apache/spark/pull/35250#discussion_r787726536



##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
##########
@@ -1344,7 +1360,13 @@ case class Sample(
       s"Sampling fraction ($fraction) must be on interval [0, 1] without 
replacement")
   }
 
-  override def maxRows: Option[Long] = child.maxRows

Review comment:
       I am not sure whether it is wrong, sampling with replacement should not 
generate more rows than the input dataset.
   
   But we can not impl a strict sampling with replacement, so `PoissonSampler` 
is used instead, which can not guarantee this attribute.
   
   ```
   scala> val df = spark.range(0, 1000)
   df: org.apache.spark.sql.Dataset[Long] = [id: bigint]
   
   scala> df.count
   res0: Long = 1000
   
   scala> df.sample(true, 0.999999, 10).count
   res1: Long = 1004
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to