Victsm commented on issue #24515: [SPARK-14083][WIP] Basic bytecode analyzer to speed up Datasets URL: https://github.com/apache/spark/pull/24515#issuecomment-491447769 We also have a reasonable collections of Dataset API use cases at LinkedIn, especially centered around offline feature engineering pipelines which relies on complex transformation logics that are not straightforward to express using DataFrame operations. We are also working on a similar prototype to address Dataset performance issue. We are trying to find a balance between bringing the benefits of bytecode analysis and dealing with its complexity. Instead of trying to fully convert the lambda function into Catalyst expression, which might run into many corner cases, we are rather focused on identifying which fields of the domain objects are being accessed and leverage that information in column pruning optimization to cut down serde and IO overhead. Want to chime into this discussion and provide our 2 cents and work with the community to see how we can push this Dataset performance enhancement effort forward.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
