[ 
https://issues.apache.org/jira/browse/SEDONA-227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653510#comment-17653510
 ] 

Doug Dennis commented on SEDONA-227:
------------------------------------

[~jiayu] It does impact collect calls. Most importantly to me, it also impacts 
+python UDFs.+ If a geometry column is getting passed to the UDF then it will 
deserialize and reserialize an entire dataframe. Additionally, this impacts 
workflows that create initial datasets in python and push them to spark (i.e. 
through createDataFrame).

[~Kontinuation] I'm excited to see that extension :)

If Jia is good with it, then I'll go ahead and see if I can get some 
improvements out of the current python implementation.

> Python SerDe Performance Degradation
> ------------------------------------
>
>                 Key: SEDONA-227
>                 URL: https://issues.apache.org/jira/browse/SEDONA-227
>             Project: Apache Sedona
>          Issue Type: Bug
>            Reporter: Doug Dennis
>            Priority: Major
>
> With the new geometry serde in Sedona, there appears to be a fairly 
> significant performance regression on the python side. The PR's author 
> acknowledged a regression in the PR so this is expected, however my trials 
> are showing a regression that is sometimes far higher than the 2x noted in 
> the PR.
> For serialization, I'm seeing points and short linestrings taking about twice 
> as long (as expected). Unfortunately, small polygons are taking about 7-8 
> times longer while long linestrings and large polygons are taking between 
> 11-12 times longer.
> The news isn't all bad though. For me, short linestrings are consistently 
> deserializing faster (about 25-30% faster) and points are deserializing at 
> roughly the same rate as before. The other deserializations show regressions 
> that are more or less in line with the results for serialization though.
> To test this, I'm strictly comparing the new serialize and deserialize sedona 
> functions against shapely's wkb loads and dumps functions. Below you will 
> find my most recent results (which have been fairly consistent) as well as 
> the python code I used to generate it. I'm very open to critiques of my 
> approach to measuring performance, and hope that some of this performance 
> loss is due to my own error.
> Serialization results:
> {code:java}
> short line serialize trial:
>         Total Time (seconds):
>                 Shapely: 1.7364926
>                 Sedona: 5.4626863
>                 Factor: 2.145816054730092        
>         Average Time (nanoseconds):
>                 Shapely: 8682.463
>                 Sedona: 27313.4315
>                 Factor: 2.145816054730092
> long line serialize trial:
>         Total Time (seconds):
>                 Shapely: 4.0879395
>                 Sedona: 50.1508444
>                 Factor: 11.268000639441949
>         Average Time (nanoseconds):
>                 Shapely: 40879.395
>                 Sedona: 501508.444
>                 Factor: 11.268000639441949
> point serialize trial:
>         Total Time (seconds):
>                 Shapely: 4.7864782
>                 Sedona: 13.0319586
>                 Factor: 1.7226612251153677
>         Average Time (nanoseconds):
>                 Shapely: 9572.9564
>                 Sedona: 26063.9172
>                 Factor: 1.7226612251153677
> small polygon serialize trial:
>         Total Time (seconds):
>                 Shapely: 1.8339082
>                 Sedona: 14.9376628
>                 Factor: 7.145262014750793
>         Average Time (nanoseconds):
>                 Shapely: 9169.541
>                 Sedona: 74688.314
>                 Factor: 7.145262014750793
> large polygon serialize trial:
>         Total Time (seconds):
>                 Shapely: 2.3705298
>                 Sedona: 30.4154897
>                 Factor: 11.830671734225826
>         Average Time (nanoseconds):
>                 Shapely: 23705.298
>                 Sedona: 304154.897
>                 Factor: 11.830671734225826 {code}
> Deserialization results:
> {code:java}
> short line deserialize trial:
>         Total Time (seconds):
>                 Shapely: 2.5166469
>                 Sedona: 1.7909991
>                 Factor: -0.28833913887562057
>         Average Time (nanoseconds):
>                 Shapely: 12583.2345
>                 Sedona: 8954.9955
>                 Factor: -0.28833913887562057
> long line deserialize trial:
>         Total Time (seconds):
>                 Shapely: 3.1818201
>                 Sedona: 45.1792348
>                 Factor: 13.199179519923204
>         Average Time (nanoseconds):
>                 Shapely: 31818.201
>                 Sedona: 451792.348
>                 Factor: 13.199179519923204
> point deserialize trial:
>         Total Time (seconds):
>                 Shapely: 5.7874722
>                 Sedona: 5.3168965
>                 Factor: -0.08130936680784402
>         Average Time (nanoseconds):
>                 Shapely: 11574.9444
>                 Sedona: 10633.793
>                 Factor: -0.08130936680784402
> small polygon deserialize trial:
>         Total Time (seconds):
>                 Shapely: 2.5079775
>                 Sedona: 4.0216245
>                 Factor: 0.6035329264317563
>         Average Time (nanoseconds):
>                 Shapely: 12539.8875
>                 Sedona: 20108.1225
>                 Factor: 0.6035329264317563
> large polygon deserialize trial:
>         Total Time (seconds):
>                 Shapely: 1.9952702
>                 Sedona: 19.909025
>                 Factor: 8.978109731704508
>         Average Time (nanoseconds):
>                 Shapely: 19952.702
>                 Sedona: 199090.25
>                 Factor: 8.978109731704508 {code}
> Python code used to generate results:
> {code:java}
> from sedona.utils.geometry_serde import serialize, deserialize
> from shapely.geometry import LineString, Point, Polygon
> from shapely.wkb import dumps, loads
> import time
> def run_serialize_trial(geom, number_iterations, name):
>     print(f"{name} serialize trial:")
>     start_time = time.perf_counter_ns()
>     for _ in range(number_iterations):
>         dumps(geom)
>     shapely_time = time.perf_counter_ns() - start_time
>     start_time = time.perf_counter_ns()
>     for _ in range(number_iterations):
>         serialize(geom)
>     sedona_time = time.perf_counter_ns() - start_time
>     print(f"\tTotal Time (seconds):")
>     print(f"\t\tShapely: {shapely_time / 1e9}\n\t\tSedona: {sedona_time / 
> 1e9}\n\t\tFactor: {(sedona_time - shapely_time) / shapely_time}\n")
>     print(f"\tAverage Time (nanoseconds):")
>     print(f"\t\tShapely: {shapely_time / number_iterations}\n\t\tSedona: 
> {sedona_time / number_iterations}\n\t\tFactor: {(sedona_time - shapely_time) 
> / shapely_time}\n")
> def run_deserialize_trial(geom, number_iterations, name):
>     print(f"{name} deserialize trial:")
>     shapely_serialized_geom = dumps(geom)
>     sedona_serialized_geom = serialize(geom)
>     start_time = time.perf_counter_ns()
>     for _ in range(number_iterations):
>         loads(shapely_serialized_geom)
>     shapely_time = time.perf_counter_ns() - start_time
>     start_time = time.perf_counter_ns()
>     for _ in range(number_iterations):
>         deserialize(sedona_serialized_geom)
>     sedona_time = time.perf_counter_ns() - start_time
>     print(f"\tTotal Time (seconds):")
>     print(f"\t\tShapely: {shapely_time / 1e9}\n\t\tSedona: {sedona_time / 
> 1e9}\n\t\tFactor: {(sedona_time - shapely_time) / shapely_time}\n")
>     print(f"\tAverage Time (nanoseconds):")
>     print(f"\t\tShapely: {shapely_time / number_iterations}\n\t\tSedona: 
> {sedona_time / number_iterations}\n\t\tFactor: {(sedona_time - shapely_time) 
> / shapely_time}\n")
> short_line_iterations = 200_000
> short_line = LineString([(10.0, 10.0), (20.0, 20.0)])
> long_line_iterations = 100_000
> long_line = LineString([(float(n), float(n)) for n in range(1000)])
> point_iterations = 500_000
> point = Point(12.3, 45.6)
> small_polygon_iterations = 200_000
> small_polygon = Polygon([(10.0, 10.0), (20.0, 10.0), (20.0, 20.0), (10.0, 
> 20.0), (10.0, 10.0)])
> large_polygon_iterations = 100_000
> large_polygon = Polygon(
>     [(0.0, float(n * 10)) for n in range(100)]
>     + [(float(n * 10), 990.0) for n in range(100)]
>     + [(990.0, float(n * 10)) for n in reversed(range(100))]
>     + [(float(n * 10), 0.0) for n in reversed(range(100))]
> )
> run_serialize_trial(short_line, short_line_iterations, "short line")
> run_serialize_trial(long_line, long_line_iterations, "long line")
> run_serialize_trial(point, point_iterations, "point")
> run_serialize_trial(small_polygon, small_polygon_iterations, "small polygon")
> run_serialize_trial(large_polygon, large_polygon_iterations, "large polygon")
> run_deserialize_trial(short_line, short_line_iterations, "short line")
> run_deserialize_trial(long_line, long_line_iterations, "long line")
> run_deserialize_trial(point, point_iterations, "point")
> run_deserialize_trial(small_polygon, small_polygon_iterations, "small 
> polygon")
> run_deserialize_trial(large_polygon, large_polygon_iterations, "large 
> polygon"){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to