Hi Jia, Thanks very much for your help! The setup worked, I managed to run Apache Sedona in a Jupyter Notebook :)
However, another problem has occurred. I have two cases: 1. small join: gdf1 contains POLYGONs, shape: (250 rows, 3 columns), gdf2 contains POINTs, shape: (2+ million rows, 5 columns) 2. big join: sdf1 contains POLYGONs, shape: (250 rows, 3 columns), sdf2 contains POINs, shape: (56+ million rows, 5 columns) The "small join" takes 3 seconds to run in Geopandas, but 5 minutes to run in PySpark with Apache Sedona. All the setting are done based on this notebook: https://github.com/apache/incubator-sedona/blob/master/binder/ApacheSedonaSQL.ipynb , with these 3 lines in addition: spark.conf.set("sedona.global.index", "true") spark.conf.set("sedona.global.indextype", "rtree") spark.conf.set("sedona.join.gridtype", "kdbtree based on the setting of this file: https://github.com/iag-geo/spark_testing/blob/master/apache_sedona/02_run_spatial_query.py I also tried spatial partitioning, creating RDDs, then made JoinQuery, after that JoinQueryRaw as well, but it took again around 5 minutes. I tried out the "big join" with Apache Sedona. After an hour and a half, I received the following warnings and errors: WARN BlockManager: Block rdd_53_1 could not be removed as it was not found on disk or in memory ERROR Executor: Exception in task 1.0 in stage 11.0 (TID 187) java.lang.OutOfMemoryError: Java heap space ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker for task 187,5,main] java.lang.OutOfMemoryError: Java heap space WARN TaskSetManager: Lost task 1.0 in stage 11.0 (TID 187, roberts-mbp, executor driver): java.lang.OutOfMemoryError: Java heap space ERROR TaskSetManager: Task 1 in stage 11.0 failed 1 times; aborting job ERROR Utils: Uncaught exception in thread Executor task launch worker for task 190 java.lang.NullPointerException There should be some problem with my settings, but I cannot go forward without help from someone with more experience. Do you have any recommendations, what to read or how to try to make the "big join"? Have a nice evening, Robert On Sat, Mar 6, 2021 at 9:54 PM Jia Yu <[email protected]> wrote: > Hi Robert, > > The tutorial you found on our website is a step-by-step tutorial for > Python Jupyter. In that tutorial, pipenv will install all dependencies from > binder/Pipfile: > https://github.com/apache/incubator-sedona/tree/master/binder > > If you run into any specific issues, you can post here and we can help you. > > Thanks, > Jia > > On Sat, Mar 6, 2021 at 9:57 AM Robert Bozsik <[email protected]> > wrote: > >> Hi all, >> I am new to PySpark and programming. >> I would like to do a spatial join between two geographical datasets, one >> consists of 50+ million rows. >> Is there here anyone who could explain to me step by step how to install >> Apache Sedona (GeoSpark) and its dependencies on a Mac? >> After the installation I would like to run it locally in a virtual >> environment, first using Jupyter Notebook then in a .py file. >> On the official website I have found a quick start guide: >> https://sedona.apache.org/download/overview/ >> and a Python Jupyter Notebook Examples guide: >> https://sedona.apache.org/tutorial/jupyter-notebook/ >> However, it is still not clear, how to install and make it run. >> Unfortunately, I didn't find any useful step-by-step guide with the help >> of Google or YouTube and feel myself in an infinite loop of reading links >> after links that explain always different solutions. >> Thanks a lot in advance, >> Robert >> [email protected] >> >
