revans2 commented on pull request #29067:
URL: https://github.com/apache/spark/pull/29067#issuecomment-657544856


   @maropu 
   
   > You cannot use the data source V2 interface for your purpose?
   
   We want to produce a transparent replacement for `.cache`, `.persist` and 
the SQL `CACHE` operator using GPUs for acceleration.  Caching data right now 
is slow on the CPU.  `.cache` is a common enough operation that we would like 
to support it in our plugin and feel that it is something that we can really 
accelerate.  In theory, we could reuse the datasource v2 API, but it would 
require a lot more refactoring to make it fit into the cache operator. Possibly 
refactoring of the data source V2 API as well. If that is what you think we 
need to do I can work on it, but it will be a much bigger change.
   
   > The current cache structure is tightly coupled to the Spark internal one, 
so I'm not sure that we can directly expose it to 3rd-party developers.
   
   Yes, many of the APIs used by this code are internal to Spark and are 
subject to change at any moment. That is no different than with many other 
plugin APIs, like the ones we use to enable our GPU accelerated dataframe 
plugin. Or the callback APIs currently used for metrics. If you would like me 
to do more to document that these APIs, and the APIs that they depend on, are 
unstable and can change I am happy to do it.
   
   > If `SPARK_CACHE_SERIALIZER` specified, does the current approach replace 
all the existing caching operations with custom ones?
   
    `SPARK_CACHE_SERIALIZER` is a static conf so if you specify it, it will be 
used for all cache operations within that session. But cache is not shared 
between sessions so I felt that was reasonable because there was no regression 
in functionality.
   
   > Users cannot select which cache structure (default or custom) is used on 
runtime?
   
   A user cannot switch between modes within the same session, but each spark 
session can have a separate setting. The goal of this was to prevent users from 
changing the setting after caching data. In theory, the code as it is written 
should be able to handle changing the setting at any point in time, except for 
the code to load the plugin as a singleton. That use case was not one that I 
currently am concerned about so I didn't want to add in tests for it nor commit 
to try and support it. If you really want this use case to be supported I can 
make the needed changes and test/document it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to