gaoyangxiaozhu commented on issue #5315:
URL: 
https://github.com/apache/incubator-gluten/issues/5315#issuecomment-2102615417

   Hey @zhouyuan / @FelixYBW / @rui-mo ,
   
   I'm starting to work on this feature, and here's a quick draft to share my 
ideas and one basic but crucial question need you help confirm before i start 
the detailed design doc and code implement.
   
   In Velox, there's no such sampleNode available, but it does support random 
sampling push down with jimmy's this PR [Table sampling push 
down](https://github.com/facebookincubator/velox/commit/3d9cf528b065052e3d7ff6d0603035f5b56ebfc3#diff-58b64d1e01b72d7e092a092708f13e1a2785f0151709bce18b1c736a0c8d28ee)
 , it support accelerate random sampling based on Bernoulli trials by push the 
random sampling operations to table scan.
   
   So, my current idea is to leverage the existing sample filter pushdown 
logic. This involves transforming the vanilla Spark sampleExec node into a 
filter node, with the sample operation transformed into a sample filter 
expression., ultimately pushing the random sampling filter  down to the scan 
filter.
   
   The issue here is that Spark uses  
[XORShiftRandom](https://github.com/apache/spark/blob/207d675110e6fa699a434e81296f6f050eb0304b/core/src/main/scala/org/apache/spark/util/random/RandomSampler.scala#L57C1-L58C49)
 pseudorandom number generator and [Bernoulli trials based 
sampler](https://github.com/apache/spark/blob/207d675110e6fa699a434e81296f6f050eb0304b/core/src/main/scala/org/apache/spark/util/random/RandomSampler.scala#L116C1-L123C6)
 for sampling, while Velox employs__gnu_cxx::sfmt19937 random number generator 
with geometric distribution for sampling, checking here 
[RandomUtil](https://github.com/facebookincubator/velox/blob/main/velox/common/base/RandomUtil.h#L93C1-L95C33).
 Thus, even with the same `fraction` and `seed` (which user can specify), the 
sampled results will differ from the vinalla Spark results vs velox.
   
   So, my basic but crucial question here is whether it's acceptable for such 
correctness issues to exist in the sample scenario when offload to velox. 
   
   If it is acceptable, I'll then start by drafting a design document in the 
gluten channel and then proceed with the code implementation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to