gaoyangxiaozhu commented on issue #5315: URL: https://github.com/apache/incubator-gluten/issues/5315#issuecomment-2102615417
Hey @zhouyuan / @FelixYBW / @rui-mo , I'm starting to work on this feature, and here's a quick draft to share my ideas and one basic but crucial question need you help confirm before i start the detailed design doc and code implement. In Velox, there's no such sampleNode available, but it does support random sampling push down with jimmy's this PR [Table sampling push down](https://github.com/facebookincubator/velox/commit/3d9cf528b065052e3d7ff6d0603035f5b56ebfc3#diff-58b64d1e01b72d7e092a092708f13e1a2785f0151709bce18b1c736a0c8d28ee) , it support accelerate random sampling based on Bernoulli trials by push the random sampling operations to table scan. So, my current idea is to leverage the existing sample filter pushdown logic. This involves transforming the vanilla Spark sampleExec node into a filter node, with the sample operation transformed into a sample filter expression., ultimately pushing the random sampling filter down to the scan filter. The issue here is that Spark uses [XORShiftRandom](https://github.com/apache/spark/blob/207d675110e6fa699a434e81296f6f050eb0304b/core/src/main/scala/org/apache/spark/util/random/RandomSampler.scala#L57C1-L58C49) pseudorandom number generator and [Bernoulli trials based sampler](https://github.com/apache/spark/blob/207d675110e6fa699a434e81296f6f050eb0304b/core/src/main/scala/org/apache/spark/util/random/RandomSampler.scala#L116C1-L123C6) for sampling, while Velox employs__gnu_cxx::sfmt19937 random number generator with geometric distribution for sampling, checking here [RandomUtil](https://github.com/facebookincubator/velox/blob/main/velox/common/base/RandomUtil.h#L93C1-L95C33). Thus, even with the same `fraction` and `seed` (which user can specify), the sampled results will differ from the vinalla Spark results vs velox. So, my basic but crucial question here is whether it's acceptable for such correctness issues to exist in the sample scenario when offload to velox. If it is acceptable, I'll then start by drafting a design document in the gluten channel and then proceed with the code implementation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
