Github user colorant commented on the pull request:
https://github.com/apache/spark/pull/916#issuecomment-45981871
@dorx Do you think this works for extreme large data set with really small
sample size? e.g. n = 1.0x10^11 while sample = 1 ? in that case, the final
adjusted fraction lead to around 1.2x10^-9, by theory, there are still 99.99
chance to get sample. But since Double also has precision issue, do you think
it is enough to guarantee 99.99 chance under this extreme condition? I am
wondering about this is because, Actually, in the very case, the original code
(3x(1+1)) / total will give a fraction around 6x10^-10, which is just about
half size of the new code. And under that fraction value. it keep loop for ever
and never did get a chance to return that 1 sample.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---