[
https://issues.apache.org/jira/browse/SEDONA-227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653509#comment-17653509
]
Jia Yu commented on SEDONA-227:
-------------------------------
My understanding is that this will only affect the following functions:
toGeoPandas() and collect() which collect a distributed dataset to a
centralized data structure. The users are supposed to do this on a small
dataset (cardinality < around 10K) and hence this does not have noticeable
impact on use experience. [~dougdennis] [~Kontinuation] Please correct me if I
am wrong.
> Python SerDe Performance Degradation
> ------------------------------------
>
> Key: SEDONA-227
> URL: https://issues.apache.org/jira/browse/SEDONA-227
> Project: Apache Sedona
> Issue Type: Bug
> Reporter: Doug Dennis
> Priority: Major
>
> With the new geometry serde in Sedona, there appears to be a fairly
> significant performance regression on the python side. The PR's author
> acknowledged a regression in the PR so this is expected, however my trials
> are showing a regression that is sometimes far higher than the 2x noted in
> the PR.
> For serialization, I'm seeing points and short linestrings taking about twice
> as long (as expected). Unfortunately, small polygons are taking about 7-8
> times longer while long linestrings and large polygons are taking between
> 11-12 times longer.
> The news isn't all bad though. For me, short linestrings are consistently
> deserializing faster (about 25-30% faster) and points are deserializing at
> roughly the same rate as before. The other deserializations show regressions
> that are more or less in line with the results for serialization though.
> To test this, I'm strictly comparing the new serialize and deserialize sedona
> functions against shapely's wkb loads and dumps functions. Below you will
> find my most recent results (which have been fairly consistent) as well as
> the python code I used to generate it. I'm very open to critiques of my
> approach to measuring performance, and hope that some of this performance
> loss is due to my own error.
> Serialization results:
> {code:java}
> short line serialize trial:
> Total Time (seconds):
> Shapely: 1.7364926
> Sedona: 5.4626863
> Factor: 2.145816054730092
> Average Time (nanoseconds):
> Shapely: 8682.463
> Sedona: 27313.4315
> Factor: 2.145816054730092
> long line serialize trial:
> Total Time (seconds):
> Shapely: 4.0879395
> Sedona: 50.1508444
> Factor: 11.268000639441949
> Average Time (nanoseconds):
> Shapely: 40879.395
> Sedona: 501508.444
> Factor: 11.268000639441949
> point serialize trial:
> Total Time (seconds):
> Shapely: 4.7864782
> Sedona: 13.0319586
> Factor: 1.7226612251153677
> Average Time (nanoseconds):
> Shapely: 9572.9564
> Sedona: 26063.9172
> Factor: 1.7226612251153677
> small polygon serialize trial:
> Total Time (seconds):
> Shapely: 1.8339082
> Sedona: 14.9376628
> Factor: 7.145262014750793
> Average Time (nanoseconds):
> Shapely: 9169.541
> Sedona: 74688.314
> Factor: 7.145262014750793
> large polygon serialize trial:
> Total Time (seconds):
> Shapely: 2.3705298
> Sedona: 30.4154897
> Factor: 11.830671734225826
> Average Time (nanoseconds):
> Shapely: 23705.298
> Sedona: 304154.897
> Factor: 11.830671734225826 {code}
> Deserialization results:
> {code:java}
> short line deserialize trial:
> Total Time (seconds):
> Shapely: 2.5166469
> Sedona: 1.7909991
> Factor: -0.28833913887562057
> Average Time (nanoseconds):
> Shapely: 12583.2345
> Sedona: 8954.9955
> Factor: -0.28833913887562057
> long line deserialize trial:
> Total Time (seconds):
> Shapely: 3.1818201
> Sedona: 45.1792348
> Factor: 13.199179519923204
> Average Time (nanoseconds):
> Shapely: 31818.201
> Sedona: 451792.348
> Factor: 13.199179519923204
> point deserialize trial:
> Total Time (seconds):
> Shapely: 5.7874722
> Sedona: 5.3168965
> Factor: -0.08130936680784402
> Average Time (nanoseconds):
> Shapely: 11574.9444
> Sedona: 10633.793
> Factor: -0.08130936680784402
> small polygon deserialize trial:
> Total Time (seconds):
> Shapely: 2.5079775
> Sedona: 4.0216245
> Factor: 0.6035329264317563
> Average Time (nanoseconds):
> Shapely: 12539.8875
> Sedona: 20108.1225
> Factor: 0.6035329264317563
> large polygon deserialize trial:
> Total Time (seconds):
> Shapely: 1.9952702
> Sedona: 19.909025
> Factor: 8.978109731704508
> Average Time (nanoseconds):
> Shapely: 19952.702
> Sedona: 199090.25
> Factor: 8.978109731704508 {code}
> Python code used to generate results:
> {code:java}
> from sedona.utils.geometry_serde import serialize, deserialize
> from shapely.geometry import LineString, Point, Polygon
> from shapely.wkb import dumps, loads
> import time
> def run_serialize_trial(geom, number_iterations, name):
> print(f"{name} serialize trial:")
> start_time = time.perf_counter_ns()
> for _ in range(number_iterations):
> dumps(geom)
> shapely_time = time.perf_counter_ns() - start_time
> start_time = time.perf_counter_ns()
> for _ in range(number_iterations):
> serialize(geom)
> sedona_time = time.perf_counter_ns() - start_time
> print(f"\tTotal Time (seconds):")
> print(f"\t\tShapely: {shapely_time / 1e9}\n\t\tSedona: {sedona_time /
> 1e9}\n\t\tFactor: {(sedona_time - shapely_time) / shapely_time}\n")
> print(f"\tAverage Time (nanoseconds):")
> print(f"\t\tShapely: {shapely_time / number_iterations}\n\t\tSedona:
> {sedona_time / number_iterations}\n\t\tFactor: {(sedona_time - shapely_time)
> / shapely_time}\n")
> def run_deserialize_trial(geom, number_iterations, name):
> print(f"{name} deserialize trial:")
> shapely_serialized_geom = dumps(geom)
> sedona_serialized_geom = serialize(geom)
> start_time = time.perf_counter_ns()
> for _ in range(number_iterations):
> loads(shapely_serialized_geom)
> shapely_time = time.perf_counter_ns() - start_time
> start_time = time.perf_counter_ns()
> for _ in range(number_iterations):
> deserialize(sedona_serialized_geom)
> sedona_time = time.perf_counter_ns() - start_time
> print(f"\tTotal Time (seconds):")
> print(f"\t\tShapely: {shapely_time / 1e9}\n\t\tSedona: {sedona_time /
> 1e9}\n\t\tFactor: {(sedona_time - shapely_time) / shapely_time}\n")
> print(f"\tAverage Time (nanoseconds):")
> print(f"\t\tShapely: {shapely_time / number_iterations}\n\t\tSedona:
> {sedona_time / number_iterations}\n\t\tFactor: {(sedona_time - shapely_time)
> / shapely_time}\n")
> short_line_iterations = 200_000
> short_line = LineString([(10.0, 10.0), (20.0, 20.0)])
> long_line_iterations = 100_000
> long_line = LineString([(float(n), float(n)) for n in range(1000)])
> point_iterations = 500_000
> point = Point(12.3, 45.6)
> small_polygon_iterations = 200_000
> small_polygon = Polygon([(10.0, 10.0), (20.0, 10.0), (20.0, 20.0), (10.0,
> 20.0), (10.0, 10.0)])
> large_polygon_iterations = 100_000
> large_polygon = Polygon(
> [(0.0, float(n * 10)) for n in range(100)]
> + [(float(n * 10), 990.0) for n in range(100)]
> + [(990.0, float(n * 10)) for n in reversed(range(100))]
> + [(float(n * 10), 0.0) for n in reversed(range(100))]
> )
> run_serialize_trial(short_line, short_line_iterations, "short line")
> run_serialize_trial(long_line, long_line_iterations, "long line")
> run_serialize_trial(point, point_iterations, "point")
> run_serialize_trial(small_polygon, small_polygon_iterations, "small polygon")
> run_serialize_trial(large_polygon, large_polygon_iterations, "large polygon")
> run_deserialize_trial(short_line, short_line_iterations, "short line")
> run_deserialize_trial(long_line, long_line_iterations, "long line")
> run_deserialize_trial(point, point_iterations, "point")
> run_deserialize_trial(small_polygon, small_polygon_iterations, "small
> polygon")
> run_deserialize_trial(large_polygon, large_polygon_iterations, "large
> polygon"){code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)