[GitHub] [spark] metasim commented on pull request #31461: [SPARK-7768][CORE][SQL] Open UserDefinedType as a Developer API

GitBox Mon, 08 Feb 2021 08:44:25 -0800


metasim commented on pull request #31461:
URL: https://github.com/apache/spark/pull/31461#issuecomment-775280219

> @jnh5y @viirya I got a good question from @marmbrus - can you support
user-defined types by just defining an Encoder for it? so that it can work in a
Dataset?

@srowen I'll jump in here as I've contended with this quite a lot. If I
understand the question, the answer is "no", at least not for `TileUDT`, and I
doubt it for `AbstractGeometryUDT` (but the use cases are actually quite
different). RasterFrames has [a host of
`Encoder`s](https://github.com/locationtech/rasterframes/tree/develop/core/src/main/scala/org/locationtech/rasterframes/encoders),
and defaults to that modality whenever possible. But the computational
dynamics of spatiotemporal raster data require further optimizations, provided
for by using UDTs.

For `TileUDT`, there are a couple things going on:

1. Geospatial raster data is what I call "heavy data" (there's not just many
rows, but those rows are very thick). Raster data is relatively expensive to
read, consumes a lot of memory, and comes in many encodings and arbitrary
projections. Often in an analysis a lot of reasoning can be done over the
metadata of the raster, filtering out rows or providing opportunities for
logical plan optimization before the raster data is even read into memory. We
defer reading of the pixels/cells until absolutely necessary. Therefore, behind
`TileUDT` we are implementing old-school OO information hiding and polymorphism
to enable lazy evaluation of raster data,
[implementing](https://github.com/locationtech/rasterframes/blob/0da9aa936fc4dc0e27a91ca5809bc0416b0c4174/core/src/main/scala/org/apache/spark/sql/rf/TileUDT.scala#L83-L89)
essentially two specializations: logically, a "RasterRefTile" (lazy Tile) and
a "RealizedTile" (eager Tile). Once the plan identifies the need for
pixels/cells the "R
asterRefTile" is converted to a "RealizedTile"
2. The structure of a TileUDT has and is evolving. We need information
hiding to keep users from depending on the internal structure, and to provide
us with backwards compatible migration paths that `Encoder`s alone are not able
to provide.
3. Serialization to/from Parquet is UDT aware, so reading in a RasterFrames
Parquet file maintains type coherence in a way `Encoder`s alone would not
4. Serialization to/from PySpark, while extremely expensive (because Spark's
Arrow encoder doesn't support UDTs), does allow for nice user-facing ergonomics
whereby we can have a first-class `Tile` Python class mirroring the JVM
instance that, again, hides the nasty details of the raster structure while
providing NumPy and Pandas interoperability. I don't think we'd be able to
provide the same ergonomics if we relied on UDTs alone.

I'll note the PySpark use case is also very pertinent to
[interoperability](https://github.com/locationtech/rasterframes/blob/develop/pyrasterframes/src/main/python/geomesa_pyspark/types.py)
between GeoMesa's `AbstractGeometryUDT` and Python's Shapely library.

I hope all this helps, and doesn't just confuse things further. In short, we
tried to avoid UDTs for quite a while in RasterFrames, but with one of our
primary goals being usability by data scientists and analysts (not software
engineers), we found UDTs as the perfect mechanism to hide the sausage making
while allowing us to provide reliability and stability in the execution engine.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] metasim commented on pull request #31461: [SPARK-7768][CORE][SQL] Open UserDefinedType as a Developer API

Reply via email to