Re: [PR] script to support multi-file parquet generation [sedona-spatialbench]

via GitHub Wed, 24 Sep 2025 13:06:53 -0700


Copilot commented on code in PR #43:
URL: 
https://github.com/apache/sedona-spatialbench/pull/43#discussion_r2376936379



##########
tools/generate_data.py:
##########
@@ -0,0 +1,175 @@
+#!/usr/bin/env python3
+#  Licensed to the Apache Software Foundation (ASF) under one
+#  or more contributor license agreements.  See the NOTICE file
+#  distributed with this work for additional information
+#  regarding copyright ownership.  The ASF licenses this file
+#  to you under the Apache License, Version 2.0 (the
+#  "License"); you may not use this file except in compliance
+#  with the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing,
+#  software distributed under the License is distributed on an
+#  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+#  KIND, either express or implied.  See the License for the
+#  specific language governing permissions and limitations
+#  under the License.
+
+import argparse
+import concurrent

Review Comment:
   The `concurrent` module import on line 20 is redundant since 
`concurrent.futures` is imported on line 28 and provides all the needed 
functionality. Remove the unused `concurrent` import.
   ```suggestion
   
   ```



##########
tools/generate_data.py:
##########
@@ -0,0 +1,175 @@
+#!/usr/bin/env python3
+#  Licensed to the Apache Software Foundation (ASF) under one
+#  or more contributor license agreements.  See the NOTICE file
+#  distributed with this work for additional information
+#  regarding copyright ownership.  The ASF licenses this file
+#  to you under the Apache License, Version 2.0 (the
+#  "License"); you may not use this file except in compliance
+#  with the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing,
+#  software distributed under the License is distributed on an
+#  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+#  KIND, either express or implied.  See the License for the
+#  specific language governing permissions and limitations
+#  under the License.
+
+import argparse
+import concurrent
+import logging
+import os
+import shutil
+import subprocess
+from math import ceil, log
+from pathlib import Path
+from tempfile import mkdtemp
+import concurrent.futures

Review Comment:
   The `concurrent` module import on line 20 is redundant since 
`concurrent.futures` is imported on line 28 and provides all the needed 
functionality. Remove the unused `concurrent` import.



##########
tools/generate_data.py:
##########
@@ -0,0 +1,175 @@
+#!/usr/bin/env python3
+#  Licensed to the Apache Software Foundation (ASF) under one
+#  or more contributor license agreements.  See the NOTICE file
+#  distributed with this work for additional information
+#  regarding copyright ownership.  The ASF licenses this file
+#  to you under the Apache License, Version 2.0 (the
+#  "License"); you may not use this file except in compliance
+#  with the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing,
+#  software distributed under the License is distributed on an
+#  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+#  KIND, either express or implied.  See the License for the
+#  specific language governing permissions and limitations
+#  under the License.
+
+import argparse
+import concurrent
+import logging
+import os
+import shutil
+import subprocess
+from math import ceil, log
+from pathlib import Path
+from tempfile import mkdtemp
+import concurrent.futures
+
+
+def main():
+    # take some args: output dir, scale factor, mb per file
+
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--output-dir", type=str, 
default=mkdtemp(prefix="spatialbench-data-"),
+                        help="Output directory for generated data")
+    parser.add_argument("--scale-factor", type=int, default=1, help="Scale 
factor for data generation")
+    parser.add_argument("--mb-per-file", type=int, default=256, help="Rough 
megabytes per output file")
+    args = parser.parse_args()
+
+    generate_data(args.scale_factor, args.mb_per_file, args.output_dir)
+    print(f"Data generated at {args.output_dir}")
+
+
+def generate_data(scale_factor: int, target_mb: int, output_dir: str) -> 
dict[str, str]:
+    """Generate SpatialBench data using spatialbench-cli and return 
table->filepath mapping"""
+    # Aim for ~256 MB per partition file (on-disk). Use rough SF=1 size 
estimates per format.
+    # These are estimates; actual sizes vary by codec/implementation.
+
+    tables = [
+        "building",
+        "customer",
+        "driver",
+        "trip",
+        "vehicle",
+        "zone",
+    ]
+
+    size_mb_sf1 = {
+        # Values from testing sf=1
+        "building": 1.5,
+        "customer": 1.7,
+        "driver": 30.0 / 1024,
+        "trip": 280.0,
+        "vehicle": 4.0 / 1024,
+        # step functioned table size
+        "zone": 141.7,
+    }
+
+    # Compute partitions per table by scaling linearly with SF and dividing by 
target size.
+    def parts_for(table: str) -> int:
+        size_mb = size_mb_sf1.get(table, 1.0) * float(scale_factor)
+        return max(1, int(ceil(size_mb / target_mb)))
+
+    num_partitions = {table: parts_for(table) for table in tables}
+
+    # Zone table doesn't scale linearly. It has a step function.
+    if scale_factor < 10:
+        zone_size_mb = 141.7
+    elif scale_factor < 100:
+        zone_size_mb = 2.09 * 1024
+    elif scale_factor < 1000:
+        zone_size_mb = 5.68 * 1024
+    else:
+        # TODO this number is wrong, but we don't have data for >1000
+        zone_size_mb = 8.0 * 1024
+    num_partitions["zone"] = max(1, int(ceil(zone_size_mb / target_mb)))
+
+    # buildings scale sublinearly with sf: 20,000 × (1 + log₂(10)) rows
+    buildings_rows_per_mb = 13367.47  # did some empirical testing
+    building_size_mb = 20_000.0 * (1.0 + log(10, 2)) / buildings_rows_per_mb

Review Comment:
   The building size calculation uses a hardcoded value `log(10, 2)` instead of 
scaling with the actual scale factor. This should be `log(scale_factor, 2)` to 
properly calculate building sizes for different scale factors.
   ```suggestion
       building_size_mb = 20_000.0 * (1.0 + log(scale_factor, 2)) / 
buildings_rows_per_mb
   ```



##########
README.md:
##########
@@ -139,6 +139,18 @@ for PART in $(seq 1 4); do
 done
 ```
 
+#### Generate Multiple Parquet Files of Similar Size
+
+The generator cli itself supports generating multiple files via the `--parts` 
and `--part` flags. However, if you want
+to generate multiple files per table of roughly a specific size, you can use 
the provided script
+`tools/generate_data.py`.
+
+This algorithm is how data was generated for the benchmark results cited in 
the SedonaDB launch blog post.
+
+```bash
+tools/generate_data.py --scale 10 --mb-per-file 256 --output-dir sf10-parquet

Review Comment:
   The command line argument name is inconsistent with the script's actual 
argument name. The script expects `--scale-factor` but the example shows 
`--scale`. This should be `--scale-factor 10`.
   ```suggestion
   tools/generate_data.py --scale-factor 10 --mb-per-file 256 --output-dir 
sf10-parquet
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] script to support multi-file parquet generation [sedona-spatialbench]

Reply via email to