Kristin Cowalcijk created SEDONA-318:
----------------------------------------
Summary: SerDe for RasterUDT performs poorly
Key: SEDONA-318
URL: https://issues.apache.org/jira/browse/SEDONA-318
Project: Apache Sedona
Issue Type: Bug
Reporter: Kristin Cowalcijk
Attachments: image-2023-07-05-23-06-50-328.png
The SerDe for RasterUDT is barely usable. This won't be a big problem when
running simple queries like {{RS_Envelope(RS_FromGeoTiff(content))}} since the
serde-aware expressions eliminated all the serialization. However, we'll run
into problems when running queries involving raster serialization:
{code:python}
df_geotiff.alias("a").join(df_geotiff2.alias("b"), col("a.id") ==
col("b.id")).show()
{code}
Or simply collect a raster dataset:
{code:java}
dfGeoTiff.collect()
{code}
Each time we run such a query, the executors spawn several new threads. The job
may hang or raise strange exceptions when processing large raster datasets.
This is a thread dump captured on Spark UI after running several such queries:
!image-2023-07-05-23-06-50-328.png!
These threads were created by {{SerializableRenderedImage}}.
{{SerializableRenderedImage}} object will launch a TCP server in a newly
spawned thread when being serialized, and the deserialized version of
{{SerializableRenderedImage}} will connect to the server to fetch raster data.
This avoids copying the raster data when serializing the GridCoverage2D object,
but it is the worst way to implement raster serialization when we have to
process a large number of rasters in batches.
{{SerializableRenderedImage}} is also buggy. It tracks the reference count of
serialized objects in {{remoteReferenceCount}}. However, the reference counting
mechanism was not correctly implemented so it leaks memory.
We may want to create {{SerializableRenderedImage}} objects with {{useDeepCopy
= true}} to avoid these problems, but it introduces a new problem: the
finalizer of {{SerializableRenderedImage}} will always connect to the server to
decrement the remote reference count, even though there is no "server" in deep
copy mode. Tons of exceptions will be raised by the finalizer, which is quite
annoying.
{code}
INFO: IOException occurs when open the streams of the socket.
javax.media.jai.util.ImagingException: IOException occurs when open the streams
of the socket.
at
javax.media.jai.remote.SerializableRenderedImage.closeClient(SerializableRenderedImage.java:1117)
at
javax.media.jai.remote.SerializableRenderedImage.dispose(SerializableRenderedImage.java:1314)
at
javax.media.jai.remote.SerializableRenderedImage.finalize(SerializableRenderedImage.java:1259)
at java.base/java.lang.System$2.invokeFinalize(System.java:2125)
at java.base/java.lang.ref.Finalizer.runFinalizer(Finalizer.java:87)
at
java.base/java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:171)
Caused by: java.net.SocketException: Connection reset
at java.base/java.net.SocketInputStream.read(SocketInputStream.java:186)
at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
at
java.base/java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2893)
at
java.base/java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2909)
at
java.base/java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3406)
at
java.base/java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:932)
at
java.base/java.io.ObjectInputStream.<init>(ObjectInputStream.java:375)
at
javax.media.jai.remote.SerializableRenderedImage.closeClient(SerializableRenderedImage.java:1115)
... 5 more
Caused by:
java.net.SocketException: Connection reset
at java.base/java.net.SocketInputStream.read(SocketInputStream.java:186)
at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
at
java.base/java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2893)
at
java.base/java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2909)
at
java.base/java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3406)
at
java.base/java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:932)
at
java.base/java.io.ObjectInputStream.<init>(ObjectInputStream.java:375)
at
javax.media.jai.remote.SerializableRenderedImage.closeClient(SerializableRenderedImage.java:1115)
at
javax.media.jai.remote.SerializableRenderedImage.dispose(SerializableRenderedImage.java:1314)
at
javax.media.jai.remote.SerializableRenderedImage.finalize(SerializableRenderedImage.java:1259)
at java.base/java.lang.System$2.invokeFinalize(System.java:2125)
at java.base/java.lang.ref.Finalizer.runFinalizer(Finalizer.java:87)
at
java.base/java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:171)
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)