RE: Sedona question/feature - writing SpatialRDDs as single GeoJSON file

Seo, Baewon Fri, 12 Aug 2022 21:27:08 -0700

Sorry for keep sending email regarding Sedona,

I am having an issue to add field list when convert spark dataframe to 
SpatialRDD.


When I checked python code this, as you can see it should take 3 params, but 
only take 2 params (dataframe, fieldNames), it doesn't hav geometryFieldName.

Can you check this too?

Thanks,

[cid:[email protected]]

From: Jia Yu <[email protected]>
Sent: Friday, August 12, 2022 4:49 PM
To: [email protected]
Cc: Seo, Baewon <[email protected]>
Subject: Re: Sedona question/feature - writing SpatialRDDs as single GeoJSON 
file

[External]

Hi Oisin,

Spark / Sedona by design is not supposed to generate a single file from a RDD. 
But you can do this in Sedona by repartitioning. Use repartition(1) or coalesce 
(1) to make the resulting RDD only have 1 partition. Then call SaveAsGeoJSON.

The resulting file will only have 1 single folder with a single file inside.

Note that: (1) If your RDD is huge, repartitioning a RDD to 1 partition might 
crash the cluster since it puts all data in a single machine. (2) Use 
repartition(1) if possible,, because some users report coalesce(1) will lead to 
missing results.

Thanks,
Jia

On Fri, Aug 12, 2022 at 12:46 PM Bates, Oisin 
<[email protected]<mailto:[email protected]>> wrote:
Hi,
I have been using Sedona lately and encountered a specific use case that I 
believe is not currently supported.

Currently, we are using Python and saving writing our output to an Amazon S3 
bucket via Sedona's 
saveAsGeoJSON()<https://sedona.apache.org/tutorial/core-python/#save-to-permanent-storage<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsedona.apache.org%2Ftutorial%2Fcore-python%2F%23save-to-permanent-storage&data=05%7C01%7CBaewon.Seo%40t-mobile.com%7Cdcde6a68c5204d721be408da7cacbb61%7Cbe0f980bdd994b19bd7bbc71a09b026c%7C0%7C0%7C637959378565932356%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=SVbu2Iuep71e%2FuRDmkE3mMzqfNmRczjf3eUK3feld1g%3D&reserved=0>>
 function. The default here is to save a partitioned/distributed file.

Is it realistic to consider the option to write the GeoJSON output as a single 
file, or am I overlooking something fundamental in Sedona Core? I was thinking 
that something similar to 
pyspark.sql.DataFrame.coalesce<https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.coalesce.html<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspark.apache.org%2Fdocs%2F3.1.1%2Fapi%2Fpython%2Freference%2Fapi%2Fpyspark.sql.DataFrame.coalesce.html&data=05%7C01%7CBaewon.Seo%40t-mobile.com%7Cdcde6a68c5204d721be408da7cacbb61%7Cbe0f980bdd994b19bd7bbc71a09b026c%7C0%7C0%7C637959378565932356%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Yg1oDc4xtlrWjUxVDPw%2BsXxr6LCrr1AD3CC725RzjZ8%3D&reserved=0>>
 might be the most logical implementation?

If my thoughts here seem reasonable, I'm happy to create a Jira ticket also. 
Appreciate your time and help on this.
Best,
Oisín

RE: Sedona question/feature - writing SpatialRDDs as single GeoJSON file

Reply via email to