revans2 commented on pull request #29067:
URL: https://github.com/apache/spark/pull/29067#issuecomment-657090408


   @HyukjinKwon sure, I am really proposing one API and one helper that could 
let someone reuse some of the existing code.  
   
   1. `CachedBatch` is a trait used to tag data that is intended to be cached. 
It has a few APIs that lets us keep the compression/serialization of the data 
separate from the metrics about it.
   2. `CachedBatchSerializer` provides the APIs that must be implemented to 
cache data.
       * `convertForCache` is an API that runs a cached spark plan and turns 
its result into an `RDD[CachedBatch]`.  The actual caching is done outside of 
this API
      * `buildFilter` is an API that takes a set of predicates and builds a 
filter function that can be used to filter the `RDD[CachedBatch]` returned by 
`convertForCache`
      * `decompressColumnar` decompresses an `RDD[CachedBatch]` into an 
`RDD[ColumnarBatch]` This is only used for a limited set of data types.  These 
data types may expand in the future.  If they do we can add in a new API with a 
default value that says which data types this serializer supports.
      * `decompressToRows` decompresses an `RDD[CachedBatch]` into an 
`RDD[InternalRow]` this API, like `decompressColumnar` decompresses the data in 
`CachedBatch` but turns it into `InternalRow`s, typically using code generation 
for performance reasons.
   
   There is also an API that lets you reuse the current filtering based on 
min/max values. `SimpleMetricsCachedBatch` and 
`SimpleMetricsCachedBatchSerializer`.
   
   These are set up so that a user can explore other types of compression or 
indexing and their impact on performance.  One could look at adding a bloom 
filter in addition to the min/max values currently supported for filtering. One 
could look at adding in compression for some of the data types not currently 
supported, like arrays or structs if that is something that is an important use 
case for someone.
   
   The use case we have right now is in connection with 
https://github.com/NVIDIA/spark-rapids where we would like to provide 
compression that is better suited for both the CPU and the GPU.
   
   The config right now is a static conf allowing one and only one cache 
serializer per session. That can change in the future but I didn't want to add 
in the extra code to track the serializer along with the RDD until there was a 
use case for it, and theoretically, if something is a clear win it should just 
be incorporated into the default serializer instead.
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to