rxin commented on issue #24515: [SPARK-14083][WIP] Basic bytecode analyzer to 
speed up Datasets
URL: https://github.com/apache/spark/pull/24515#issuecomment-489378015
 
 
   As the person that initially filed the ticket, I actually no longer believe 
in it, for the following reasons:
   
   1. The typed Dataset API usage is very small. On Databricks, which covers 
thousands of organizations, roughly 1% of the workloads use the typed Dataset 
API. We didn't in particular encourage users to do one way vs another. They 
just ended up mostly using the untyped DataFrame API. So the number of users 
this would benefit would be small.
   
   2. It is really difficult to get this working well. Collectively I think we 
have sunk over two person-years on this with some pretty strong engineers, and 
the prototype we had was still pretty bad that we decided not to ship it in 
production and eventually deleted all the code from our codebase. It is very 
easy to make couple simple programs work to demo this feature, but users get 
confused when they hit a performance cliff because a very simple addition to 
their program now breaks the optimization.
   
   3. There's significant maintenance overhead, maybe the highest in all of 
Spark. This part alone if you take it to the extreme would be like building a 
brand new JVM. The number of people that will be able to understand the 
codebase will be tiny.
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to