Github user mgaido91 commented on the issue:
https://github.com/apache/spark/pull/21498
@viirya sorry, I somehow lost your updated benchmark. Yes, it makes sense.
In the case without any shuffle needed after the union we have about a 2%
performance regression. I am not sure about the reliability of the tests with
`sample` as they may return a different number of rows IIUC. Can we remove the
two sample operations and leave just the filter?
Moreover, I think it would be also interesting to understand how much time
is spent in collecting for instance. Because if, for instance, the time to
collect the data to the driver is very high, that the performance regression
would be much higher in percentage. Though I am not sure how to estimate it
properly honestly. Do you have any idea about this?
@cloud-fan @kiszk what do you think?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]