Hi Jia,

Thanks very much for your help!
The setup worked, I managed to run Apache Sedona in a Jupyter Notebook :)

However, another problem has occurred.
I have two cases:
1. small join: gdf1 contains POLYGONs, shape: (250 rows, 3 columns), gdf2
contains POINTs, shape: (2+ million rows, 5 columns)
2. big join: sdf1 contains POLYGONs, shape: (250 rows, 3 columns), sdf2
contains POINs, shape: (56+ million rows, 5 columns)

The "small join" takes 3 seconds to run in Geopandas, but 5 minutes to run
in PySpark with Apache Sedona.
All the setting are done based on this notebook:
https://github.com/apache/incubator-sedona/blob/master/binder/ApacheSedonaSQL.ipynb
, with these 3 lines in addition:
spark.conf.set("sedona.global.index", "true")
spark.conf.set("sedona.global.indextype", "rtree")
spark.conf.set("sedona.join.gridtype", "kdbtree
based on the setting of this file:
https://github.com/iag-geo/spark_testing/blob/master/apache_sedona/02_run_spatial_query.py

I also tried spatial partitioning, creating RDDs, then made JoinQuery,
after that JoinQueryRaw as well, but it took again around 5 minutes.

I tried out the "big join" with Apache Sedona. After an hour and a half, I
received the following warnings and errors:
WARN BlockManager: Block rdd_53_1 could not be removed as it was not found
on disk or in memory
ERROR Executor: Exception in task 1.0 in stage 11.0 (TID 187)
java.lang.OutOfMemoryError: Java heap space
ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread
Thread[Executor task launch worker for task 187,5,main]
java.lang.OutOfMemoryError: Java heap space
WARN TaskSetManager: Lost task 1.0 in stage 11.0 (TID 187, roberts-mbp,
executor driver): java.lang.OutOfMemoryError: Java heap space
ERROR TaskSetManager: Task 1 in stage 11.0 failed 1 times; aborting job
ERROR Utils: Uncaught exception in thread Executor task launch worker for
task 190 java.lang.NullPointerException

There should be some problem with my settings, but I cannot go forward
without help from someone with more experience.

Do you have any recommendations, what to read or how to try to make the
"big join"?

Have a nice evening,
Robert


On Sat, Mar 6, 2021 at 9:54 PM Jia Yu <[email protected]> wrote:

> Hi Robert,
>
> The tutorial you found on our website is a step-by-step tutorial for
> Python Jupyter. In that tutorial, pipenv will install all dependencies from
> binder/Pipfile:
> https://github.com/apache/incubator-sedona/tree/master/binder
>
> If you run into any specific issues, you can post here and we can help you.
>
> Thanks,
> Jia
>
> On Sat, Mar 6, 2021 at 9:57 AM Robert Bozsik <[email protected]>
> wrote:
>
>> Hi all,
>> I am new to PySpark and programming.
>> I would like to do a spatial join between two geographical datasets, one
>> consists of 50+ million rows.
>> Is there here anyone who could explain to me step by step how to install
>> Apache Sedona (GeoSpark) and its dependencies on a Mac?
>> After the installation I would like to run it locally in a virtual
>> environment, first using Jupyter Notebook then in a .py file.
>> On the official website I have found a quick start guide:
>> https://sedona.apache.org/download/overview/
>> and a Python Jupyter Notebook Examples guide:
>> https://sedona.apache.org/tutorial/jupyter-notebook/
>> However, it is still not clear, how to install and make it run.
>> Unfortunately, I didn't find any useful step-by-step guide with the help
>> of Google or YouTube and feel myself in an infinite loop of reading links
>> after links that explain always different solutions.
>> Thanks a lot in advance,
>> Robert
>> [email protected]
>>
>

Reply via email to