Hi Oisin, Spark / Sedona by design is not supposed to generate a single file from a RDD. But you can do this in Sedona by repartitioning. Use repartition(1) or coalesce (1) to make the resulting RDD only have 1 partition. Then call SaveAsGeoJSON.
The resulting file will only have 1 single folder with a single file inside. Note that: (1) If your RDD is huge, repartitioning a RDD to 1 partition might crash the cluster since it puts all data in a single machine. (2) Use repartition(1) if possible,, because some users report coalesce(1) will lead to missing results. Thanks, Jia On Fri, Aug 12, 2022 at 12:46 PM Bates, Oisin <[email protected]> wrote: > Hi, > I have been using Sedona lately and encountered a specific use case that I > believe is not currently supported. > > Currently, we are using Python and saving writing our output to an Amazon > S3 bucket via Sedona's saveAsGeoJSON()< > https://sedona.apache.org/tutorial/core-python/#save-to-permanent-storage> > function. The default here is to save a partitioned/distributed file. > > Is it realistic to consider the option to write the GeoJSON output as a > single file, or am I overlooking something fundamental in Sedona Core? I > was thinking that something similar to pyspark.sql.DataFrame.coalesce< > https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.coalesce.html> > might be the most logical implementation? > > If my thoughts here seem reasonable, I'm happy to create a Jira ticket > also. Appreciate your time and help on this. > Best, > OisÃn > > >
