Victsm commented on issue #24515: [SPARK-14083][WIP] Basic bytecode analyzer to 
speed up Datasets
URL: https://github.com/apache/spark/pull/24515#issuecomment-491447769
 
 
   We also have a reasonable collections of Dataset API use cases at LinkedIn, 
especially centered around offline feature engineering pipelines which relies 
on complex transformation logics that are not straightforward to express using 
DataFrame operations. We are also working on a similar prototype to address 
Dataset performance issue. We are trying to find a balance between bringing the 
benefits of bytecode analysis and dealing with its complexity. Instead of 
trying to fully convert the lambda function into Catalyst expression, which 
might run into many corner cases, we are rather focused on identifying which 
fields of the domain objects are being accessed and leverage that information 
in column pruning optimization to cut down serde and IO overhead. Want to chime 
into this discussion and provide our 2 cents and work with the community to see 
how we can push this Dataset performance enhancement effort forward.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to