Hi Oisin,

Spark / Sedona by design is not supposed to generate a single file from a
RDD. But you can do this in Sedona by repartitioning. Use repartition(1) or
coalesce (1) to make the resulting RDD only have 1 partition. Then call
SaveAsGeoJSON.

The resulting file will only have 1 single folder with a single file inside.

Note that: (1) If your RDD is huge, repartitioning a RDD to 1 partition
might crash the cluster since it puts all data in a single machine. (2) Use
repartition(1) if possible,, because some users report coalesce(1) will
lead to missing results.

Thanks,
Jia

On Fri, Aug 12, 2022 at 12:46 PM Bates, Oisin <[email protected]>
wrote:

> Hi,
> I have been using Sedona lately and encountered a specific use case that I
> believe is not currently supported.
>
> Currently, we are using Python and saving writing our output to an Amazon
> S3 bucket via Sedona's saveAsGeoJSON()<
> https://sedona.apache.org/tutorial/core-python/#save-to-permanent-storage>
> function. The default here is to save a partitioned/distributed file.
>
> Is it realistic to consider the option to write the GeoJSON output as a
> single file, or am I overlooking something fundamental in Sedona Core? I
> was thinking that something similar to pyspark.sql.DataFrame.coalesce<
> https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.coalesce.html>
> might be the most logical implementation?
>
> If my thoughts here seem reasonable, I'm happy to create a Jira ticket
> also. Appreciate your time and help on this.
> Best,
> Oisín
>
>
>

Reply via email to