Martin Andersson created SEDONA-205:
---------------------------------------
Summary: Use BinaryType in GeometryUDT in Sedona Spark
Key: SEDONA-205
URL: https://issues.apache.org/jira/browse/SEDONA-205
Project: Apache Sedona
Issue Type: Improvement
Reporter: Martin Andersson
GeometryUDT currently uses ArrayType(ByteType()) as the serialized data type
for geometries. The array type in Spark is an array of objects and not
primitive types. Every byte is boxed into a Byte object and the object
reference is stored in the array. This adds a significant overhead. The more
specialized BinaryType is an array of primitive bytes.
I did a quick benchmark chaining a bunch of st-functions, no joins. With
BinaryType the performance increased by roughly 30%.
The old Apache commons-codec bundled with sernetcdf needs to be fixed first.
Otherwise Spark fails when calling encodeHexString() as seen in
https://github.com/apache/incubator-sedona/pull/704
--
This message was sent by Atlassian Jira
(v8.20.10#820010)