metasim commented on pull request #31461:
URL: https://github.com/apache/spark/pull/31461#issuecomment-775280219


   > @jnh5y @viirya I got a good question from @marmbrus - can you support 
user-defined types by just defining an Encoder for it? so that it can work in a 
Dataset?
   
   @srowen I'll jump in here as I've contended with this quite a lot. If I 
understand the question, the answer is "no", at least not for `TileUDT`, and I 
doubt it for `AbstractGeometryUDT` (but the use cases are actually quite 
different).  RasterFrames has [a host of 
`Encoder`s](https://github.com/locationtech/rasterframes/tree/develop/core/src/main/scala/org/locationtech/rasterframes/encoders),
 and defaults to that modality whenever possible. But the computational 
dynamics of spatiotemporal raster data require further optimizations, provided 
for by using UDTs.
   
   For `TileUDT`, there are a couple things going on:
    
   1. Geospatial raster data is what I call "heavy data" (there's not just many 
rows, but those rows are very thick). Raster data is relatively expensive to 
read, consumes a lot of memory, and comes in many encodings and arbitrary 
projections. Often in an analysis a lot of reasoning can be done over the 
metadata of the raster, filtering out rows or providing opportunities for 
logical plan optimization before the raster data is even read into memory. We 
defer reading of the pixels/cells until absolutely necessary. Therefore, behind 
`TileUDT` we are implementing old-school OO information hiding and polymorphism 
to enable lazy evaluation of raster data, 
[implementing](https://github.com/locationtech/rasterframes/blob/0da9aa936fc4dc0e27a91ca5809bc0416b0c4174/core/src/main/scala/org/apache/spark/sql/rf/TileUDT.scala#L83-L89)
 essentially two specializations: logically, a "RasterRefTile" (lazy Tile) and 
a "RealizedTile" (eager Tile). Once the plan identifies the need for 
pixels/cells the "R
 asterRefTile" is converted to a "RealizedTile"
   2. The structure of a TileUDT has and is evolving. We need information 
hiding to keep users from depending on the internal structure, and to provide 
us with backwards compatible migration paths that `Encoder`s alone are not able 
to provide.
   3. Serialization to/from Parquet is UDT aware, so reading in a RasterFrames 
Parquet file maintains type coherence in a way `Encoder`s alone would not
   4. Serialization to/from PySpark, while extremely expensive (because Spark's 
Arrow encoder doesn't support UDTs), does allow for nice user-facing ergonomics 
whereby we can have a first-class `Tile` Python class mirroring the JVM 
instance that, again, hides the nasty details of the raster structure while 
providing NumPy and Pandas interoperability. I don't think we'd be able to 
provide the same ergonomics if we relied on UDTs alone.
   
   I'll note the PySpark use case is also very pertinent to 
[interoperability](https://github.com/locationtech/rasterframes/blob/develop/pyrasterframes/src/main/python/geomesa_pyspark/types.py)
 between GeoMesa's `AbstractGeometryUDT` and Python's Shapely library.
   
   I hope all this helps, and doesn't just confuse things further. In short, we 
tried to avoid UDTs for quite a while in RasterFrames, but with one of our 
primary goals being usability by data scientists and analysts (not software 
engineers), we found UDTs as the perfect mechanism to hide the sausage making 
while allowing us to provide reliability and stability in the execution engine.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to