[ 
https://issues.apache.org/jira/browse/SEDONA-227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653509#comment-17653509
 ] 

Jia Yu commented on SEDONA-227:
-------------------------------

My understanding is that this will only affect the following functions: 
toGeoPandas() and collect() which collect a distributed dataset to a 
centralized data structure. The users are supposed to do this on a small 
dataset (cardinality < around 10K) and hence this does not have noticeable 
impact on use experience. [~dougdennis] [~Kontinuation]  Please correct me if I 
am wrong.

> Python SerDe Performance Degradation
> ------------------------------------
>
>                 Key: SEDONA-227
>                 URL: https://issues.apache.org/jira/browse/SEDONA-227
>             Project: Apache Sedona
>          Issue Type: Bug
>            Reporter: Doug Dennis
>            Priority: Major
>
> With the new geometry serde in Sedona, there appears to be a fairly 
> significant performance regression on the python side. The PR's author 
> acknowledged a regression in the PR so this is expected, however my trials 
> are showing a regression that is sometimes far higher than the 2x noted in 
> the PR.
> For serialization, I'm seeing points and short linestrings taking about twice 
> as long (as expected). Unfortunately, small polygons are taking about 7-8 
> times longer while long linestrings and large polygons are taking between 
> 11-12 times longer.
> The news isn't all bad though. For me, short linestrings are consistently 
> deserializing faster (about 25-30% faster) and points are deserializing at 
> roughly the same rate as before. The other deserializations show regressions 
> that are more or less in line with the results for serialization though.
> To test this, I'm strictly comparing the new serialize and deserialize sedona 
> functions against shapely's wkb loads and dumps functions. Below you will 
> find my most recent results (which have been fairly consistent) as well as 
> the python code I used to generate it. I'm very open to critiques of my 
> approach to measuring performance, and hope that some of this performance 
> loss is due to my own error.
> Serialization results:
> {code:java}
> short line serialize trial:
>         Total Time (seconds):
>                 Shapely: 1.7364926
>                 Sedona: 5.4626863
>                 Factor: 2.145816054730092        
>         Average Time (nanoseconds):
>                 Shapely: 8682.463
>                 Sedona: 27313.4315
>                 Factor: 2.145816054730092
> long line serialize trial:
>         Total Time (seconds):
>                 Shapely: 4.0879395
>                 Sedona: 50.1508444
>                 Factor: 11.268000639441949
>         Average Time (nanoseconds):
>                 Shapely: 40879.395
>                 Sedona: 501508.444
>                 Factor: 11.268000639441949
> point serialize trial:
>         Total Time (seconds):
>                 Shapely: 4.7864782
>                 Sedona: 13.0319586
>                 Factor: 1.7226612251153677
>         Average Time (nanoseconds):
>                 Shapely: 9572.9564
>                 Sedona: 26063.9172
>                 Factor: 1.7226612251153677
> small polygon serialize trial:
>         Total Time (seconds):
>                 Shapely: 1.8339082
>                 Sedona: 14.9376628
>                 Factor: 7.145262014750793
>         Average Time (nanoseconds):
>                 Shapely: 9169.541
>                 Sedona: 74688.314
>                 Factor: 7.145262014750793
> large polygon serialize trial:
>         Total Time (seconds):
>                 Shapely: 2.3705298
>                 Sedona: 30.4154897
>                 Factor: 11.830671734225826
>         Average Time (nanoseconds):
>                 Shapely: 23705.298
>                 Sedona: 304154.897
>                 Factor: 11.830671734225826 {code}
> Deserialization results:
> {code:java}
> short line deserialize trial:
>         Total Time (seconds):
>                 Shapely: 2.5166469
>                 Sedona: 1.7909991
>                 Factor: -0.28833913887562057
>         Average Time (nanoseconds):
>                 Shapely: 12583.2345
>                 Sedona: 8954.9955
>                 Factor: -0.28833913887562057
> long line deserialize trial:
>         Total Time (seconds):
>                 Shapely: 3.1818201
>                 Sedona: 45.1792348
>                 Factor: 13.199179519923204
>         Average Time (nanoseconds):
>                 Shapely: 31818.201
>                 Sedona: 451792.348
>                 Factor: 13.199179519923204
> point deserialize trial:
>         Total Time (seconds):
>                 Shapely: 5.7874722
>                 Sedona: 5.3168965
>                 Factor: -0.08130936680784402
>         Average Time (nanoseconds):
>                 Shapely: 11574.9444
>                 Sedona: 10633.793
>                 Factor: -0.08130936680784402
> small polygon deserialize trial:
>         Total Time (seconds):
>                 Shapely: 2.5079775
>                 Sedona: 4.0216245
>                 Factor: 0.6035329264317563
>         Average Time (nanoseconds):
>                 Shapely: 12539.8875
>                 Sedona: 20108.1225
>                 Factor: 0.6035329264317563
> large polygon deserialize trial:
>         Total Time (seconds):
>                 Shapely: 1.9952702
>                 Sedona: 19.909025
>                 Factor: 8.978109731704508
>         Average Time (nanoseconds):
>                 Shapely: 19952.702
>                 Sedona: 199090.25
>                 Factor: 8.978109731704508 {code}
> Python code used to generate results:
> {code:java}
> from sedona.utils.geometry_serde import serialize, deserialize
> from shapely.geometry import LineString, Point, Polygon
> from shapely.wkb import dumps, loads
> import time
> def run_serialize_trial(geom, number_iterations, name):
>     print(f"{name} serialize trial:")
>     start_time = time.perf_counter_ns()
>     for _ in range(number_iterations):
>         dumps(geom)
>     shapely_time = time.perf_counter_ns() - start_time
>     start_time = time.perf_counter_ns()
>     for _ in range(number_iterations):
>         serialize(geom)
>     sedona_time = time.perf_counter_ns() - start_time
>     print(f"\tTotal Time (seconds):")
>     print(f"\t\tShapely: {shapely_time / 1e9}\n\t\tSedona: {sedona_time / 
> 1e9}\n\t\tFactor: {(sedona_time - shapely_time) / shapely_time}\n")
>     print(f"\tAverage Time (nanoseconds):")
>     print(f"\t\tShapely: {shapely_time / number_iterations}\n\t\tSedona: 
> {sedona_time / number_iterations}\n\t\tFactor: {(sedona_time - shapely_time) 
> / shapely_time}\n")
> def run_deserialize_trial(geom, number_iterations, name):
>     print(f"{name} deserialize trial:")
>     shapely_serialized_geom = dumps(geom)
>     sedona_serialized_geom = serialize(geom)
>     start_time = time.perf_counter_ns()
>     for _ in range(number_iterations):
>         loads(shapely_serialized_geom)
>     shapely_time = time.perf_counter_ns() - start_time
>     start_time = time.perf_counter_ns()
>     for _ in range(number_iterations):
>         deserialize(sedona_serialized_geom)
>     sedona_time = time.perf_counter_ns() - start_time
>     print(f"\tTotal Time (seconds):")
>     print(f"\t\tShapely: {shapely_time / 1e9}\n\t\tSedona: {sedona_time / 
> 1e9}\n\t\tFactor: {(sedona_time - shapely_time) / shapely_time}\n")
>     print(f"\tAverage Time (nanoseconds):")
>     print(f"\t\tShapely: {shapely_time / number_iterations}\n\t\tSedona: 
> {sedona_time / number_iterations}\n\t\tFactor: {(sedona_time - shapely_time) 
> / shapely_time}\n")
> short_line_iterations = 200_000
> short_line = LineString([(10.0, 10.0), (20.0, 20.0)])
> long_line_iterations = 100_000
> long_line = LineString([(float(n), float(n)) for n in range(1000)])
> point_iterations = 500_000
> point = Point(12.3, 45.6)
> small_polygon_iterations = 200_000
> small_polygon = Polygon([(10.0, 10.0), (20.0, 10.0), (20.0, 20.0), (10.0, 
> 20.0), (10.0, 10.0)])
> large_polygon_iterations = 100_000
> large_polygon = Polygon(
>     [(0.0, float(n * 10)) for n in range(100)]
>     + [(float(n * 10), 990.0) for n in range(100)]
>     + [(990.0, float(n * 10)) for n in reversed(range(100))]
>     + [(float(n * 10), 0.0) for n in reversed(range(100))]
> )
> run_serialize_trial(short_line, short_line_iterations, "short line")
> run_serialize_trial(long_line, long_line_iterations, "long line")
> run_serialize_trial(point, point_iterations, "point")
> run_serialize_trial(small_polygon, small_polygon_iterations, "small polygon")
> run_serialize_trial(large_polygon, large_polygon_iterations, "large polygon")
> run_deserialize_trial(short_line, short_line_iterations, "short line")
> run_deserialize_trial(long_line, long_line_iterations, "long line")
> run_deserialize_trial(point, point_iterations, "point")
> run_deserialize_trial(small_polygon, small_polygon_iterations, "small 
> polygon")
> run_deserialize_trial(large_polygon, large_polygon_iterations, "large 
> polygon"){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to