LucaCanali commented on issue #26953: URL: https://github.com/apache/spark/pull/26953#issuecomment-617369628
I would not worry very much about the performance impact of this additional instrumentation, as it hooks on something that is not very fast already, that is the serialization/deserialization JVM-Python. Moreover, the instrumentation mostly just takes timing values and does so per batch of serialized rows, so the impach on the total throughput is expected to be further reduced by this. So far, I have only tested this manually and did not observe any particular impact. If we have a Python UDF benchmark I could further test with that. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
