[ 
https://issues.apache.org/jira/browse/SEDONA-227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653520#comment-17653520
 ] 

Kristin Cowalcijk commented on SEDONA-227:
------------------------------------------

It significantly impacts the performance of processing large datasets using 
Python UDF. I did some end-to-end tests for this use case. The UDF I'm testing 
is a user-defined centroid function and the queries are as follows:
{code:python}
// Python version
geom_centroid = udf(lambda g: g.centroid, GeometryType())
df_geom.select(count(geom_centroid(col("geom")).alias("geom2"))).show()
{code}
{code:scala}
// Scala version
val geom_centroid = udf((x: Geometry) => x.getCentroid().asInstanceOf[Geometry])
df_geom.select(count(geom_centroid($"geom").as("geom2"))).show()
{code}
This query was tested on various Sedona versions using two large datasets 
(OSM(partial) for points and MS Buildings (partial) for polygons), here is the 
result:
||Sedona version||Test case||Time taken (Points)||Time taken (Polygons)||
|1.3.0-incubating|Scala UDF|13s|8s|
|master (with new serde)|Scala UDF|8s|4s|
|1.3.0-incubating|Python UDF|15min|6.5min|
|[PR734|https://github.com/apache/incubator-sedona/pull/734] [1]|Python 
UDF|13min|3.9min|
|master (with new serde)|Python UDF|19min|8.9min|

We can see that the Python UDF is painfully slow compared to Scala UDF, and the 
new serde made it even slower. It may also imply that Python UDFs are not 
suitable for computation-intensive tasks running in production, though it is 
useful for running experiments on small to medium-sized datasets. After doing 
some profiling we found that only a small portion of running time (10%) was 
spent in libgeos even when using the WKB serde, the performance overhead was 
enormous. We expect a significant improvement after re-implementing the 
geometry serde in a proper way.

[1] PR734 improved python serde performance by avoiding a [costly array to 
bytes 
conversion|https://github.com/apache/incubator-sedona/pull/734/files#diff-f13aa16da1bd085966b8d3c0af784a88083ef5cbe9e447c328c5d4eda7ae4afaL46].

> Python SerDe Performance Degradation
> ------------------------------------
>
>                 Key: SEDONA-227
>                 URL: https://issues.apache.org/jira/browse/SEDONA-227
>             Project: Apache Sedona
>          Issue Type: Bug
>            Reporter: Doug Dennis
>            Priority: Major
>
> With the new geometry serde in Sedona, there appears to be a fairly 
> significant performance regression on the python side. The PR's author 
> acknowledged a regression in the PR so this is expected, however my trials 
> are showing a regression that is sometimes far higher than the 2x noted in 
> the PR.
> For serialization, I'm seeing points and short linestrings taking about twice 
> as long (as expected). Unfortunately, small polygons are taking about 7-8 
> times longer while long linestrings and large polygons are taking between 
> 11-12 times longer.
> The news isn't all bad though. For me, short linestrings are consistently 
> deserializing faster (about 25-30% faster) and points are deserializing at 
> roughly the same rate as before. The other deserializations show regressions 
> that are more or less in line with the results for serialization though.
> To test this, I'm strictly comparing the new serialize and deserialize sedona 
> functions against shapely's wkb loads and dumps functions. Below you will 
> find my most recent results (which have been fairly consistent) as well as 
> the python code I used to generate it. I'm very open to critiques of my 
> approach to measuring performance, and hope that some of this performance 
> loss is due to my own error.
> Serialization results:
> {code:java}
> short line serialize trial:
>         Total Time (seconds):
>                 Shapely: 1.7364926
>                 Sedona: 5.4626863
>                 Factor: 2.145816054730092        
>         Average Time (nanoseconds):
>                 Shapely: 8682.463
>                 Sedona: 27313.4315
>                 Factor: 2.145816054730092
> long line serialize trial:
>         Total Time (seconds):
>                 Shapely: 4.0879395
>                 Sedona: 50.1508444
>                 Factor: 11.268000639441949
>         Average Time (nanoseconds):
>                 Shapely: 40879.395
>                 Sedona: 501508.444
>                 Factor: 11.268000639441949
> point serialize trial:
>         Total Time (seconds):
>                 Shapely: 4.7864782
>                 Sedona: 13.0319586
>                 Factor: 1.7226612251153677
>         Average Time (nanoseconds):
>                 Shapely: 9572.9564
>                 Sedona: 26063.9172
>                 Factor: 1.7226612251153677
> small polygon serialize trial:
>         Total Time (seconds):
>                 Shapely: 1.8339082
>                 Sedona: 14.9376628
>                 Factor: 7.145262014750793
>         Average Time (nanoseconds):
>                 Shapely: 9169.541
>                 Sedona: 74688.314
>                 Factor: 7.145262014750793
> large polygon serialize trial:
>         Total Time (seconds):
>                 Shapely: 2.3705298
>                 Sedona: 30.4154897
>                 Factor: 11.830671734225826
>         Average Time (nanoseconds):
>                 Shapely: 23705.298
>                 Sedona: 304154.897
>                 Factor: 11.830671734225826 {code}
> Deserialization results:
> {code:java}
> short line deserialize trial:
>         Total Time (seconds):
>                 Shapely: 2.5166469
>                 Sedona: 1.7909991
>                 Factor: -0.28833913887562057
>         Average Time (nanoseconds):
>                 Shapely: 12583.2345
>                 Sedona: 8954.9955
>                 Factor: -0.28833913887562057
> long line deserialize trial:
>         Total Time (seconds):
>                 Shapely: 3.1818201
>                 Sedona: 45.1792348
>                 Factor: 13.199179519923204
>         Average Time (nanoseconds):
>                 Shapely: 31818.201
>                 Sedona: 451792.348
>                 Factor: 13.199179519923204
> point deserialize trial:
>         Total Time (seconds):
>                 Shapely: 5.7874722
>                 Sedona: 5.3168965
>                 Factor: -0.08130936680784402
>         Average Time (nanoseconds):
>                 Shapely: 11574.9444
>                 Sedona: 10633.793
>                 Factor: -0.08130936680784402
> small polygon deserialize trial:
>         Total Time (seconds):
>                 Shapely: 2.5079775
>                 Sedona: 4.0216245
>                 Factor: 0.6035329264317563
>         Average Time (nanoseconds):
>                 Shapely: 12539.8875
>                 Sedona: 20108.1225
>                 Factor: 0.6035329264317563
> large polygon deserialize trial:
>         Total Time (seconds):
>                 Shapely: 1.9952702
>                 Sedona: 19.909025
>                 Factor: 8.978109731704508
>         Average Time (nanoseconds):
>                 Shapely: 19952.702
>                 Sedona: 199090.25
>                 Factor: 8.978109731704508 {code}
> Python code used to generate results:
> {code:java}
> from sedona.utils.geometry_serde import serialize, deserialize
> from shapely.geometry import LineString, Point, Polygon
> from shapely.wkb import dumps, loads
> import time
> def run_serialize_trial(geom, number_iterations, name):
>     print(f"{name} serialize trial:")
>     start_time = time.perf_counter_ns()
>     for _ in range(number_iterations):
>         dumps(geom)
>     shapely_time = time.perf_counter_ns() - start_time
>     start_time = time.perf_counter_ns()
>     for _ in range(number_iterations):
>         serialize(geom)
>     sedona_time = time.perf_counter_ns() - start_time
>     print(f"\tTotal Time (seconds):")
>     print(f"\t\tShapely: {shapely_time / 1e9}\n\t\tSedona: {sedona_time / 
> 1e9}\n\t\tFactor: {(sedona_time - shapely_time) / shapely_time}\n")
>     print(f"\tAverage Time (nanoseconds):")
>     print(f"\t\tShapely: {shapely_time / number_iterations}\n\t\tSedona: 
> {sedona_time / number_iterations}\n\t\tFactor: {(sedona_time - shapely_time) 
> / shapely_time}\n")
> def run_deserialize_trial(geom, number_iterations, name):
>     print(f"{name} deserialize trial:")
>     shapely_serialized_geom = dumps(geom)
>     sedona_serialized_geom = serialize(geom)
>     start_time = time.perf_counter_ns()
>     for _ in range(number_iterations):
>         loads(shapely_serialized_geom)
>     shapely_time = time.perf_counter_ns() - start_time
>     start_time = time.perf_counter_ns()
>     for _ in range(number_iterations):
>         deserialize(sedona_serialized_geom)
>     sedona_time = time.perf_counter_ns() - start_time
>     print(f"\tTotal Time (seconds):")
>     print(f"\t\tShapely: {shapely_time / 1e9}\n\t\tSedona: {sedona_time / 
> 1e9}\n\t\tFactor: {(sedona_time - shapely_time) / shapely_time}\n")
>     print(f"\tAverage Time (nanoseconds):")
>     print(f"\t\tShapely: {shapely_time / number_iterations}\n\t\tSedona: 
> {sedona_time / number_iterations}\n\t\tFactor: {(sedona_time - shapely_time) 
> / shapely_time}\n")
> short_line_iterations = 200_000
> short_line = LineString([(10.0, 10.0), (20.0, 20.0)])
> long_line_iterations = 100_000
> long_line = LineString([(float(n), float(n)) for n in range(1000)])
> point_iterations = 500_000
> point = Point(12.3, 45.6)
> small_polygon_iterations = 200_000
> small_polygon = Polygon([(10.0, 10.0), (20.0, 10.0), (20.0, 20.0), (10.0, 
> 20.0), (10.0, 10.0)])
> large_polygon_iterations = 100_000
> large_polygon = Polygon(
>     [(0.0, float(n * 10)) for n in range(100)]
>     + [(float(n * 10), 990.0) for n in range(100)]
>     + [(990.0, float(n * 10)) for n in reversed(range(100))]
>     + [(float(n * 10), 0.0) for n in reversed(range(100))]
> )
> run_serialize_trial(short_line, short_line_iterations, "short line")
> run_serialize_trial(long_line, long_line_iterations, "long line")
> run_serialize_trial(point, point_iterations, "point")
> run_serialize_trial(small_polygon, small_polygon_iterations, "small polygon")
> run_serialize_trial(large_polygon, large_polygon_iterations, "large polygon")
> run_deserialize_trial(short_line, short_line_iterations, "short line")
> run_deserialize_trial(long_line, long_line_iterations, "long line")
> run_deserialize_trial(point, point_iterations, "point")
> run_deserialize_trial(small_polygon, small_polygon_iterations, "small 
> polygon")
> run_deserialize_trial(large_polygon, large_polygon_iterations, "large 
> polygon"){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to