Github user akopich commented on the issue:
https://github.com/apache/spark/pull/19565
@hhbyyh
Yes, in this way we don't change semantics of `miniBatchFraction`. But is
the way it is defined now actually correct? As I mentioned above, in the
`upstram/master` the number of non-empty documents in the mini-batch is
asymptotically normally distributed. Hence, the size of RDD fed to
`treeAggregate` differs from one batch to another. While in this PR (filtering
before sampling) all the batches have the same length.
But then again, for large datasets this should make no difference. If
nobody thinks this disparity of batch sizes is an issue, I won't object against
sampling before filtering.
@WeichenXu123, I believe, you were advocating for filter before sample. Do
you still have the preference?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]