Status from website checks

2021-03-09 Thread Felix Cheung
https://whimsy.apache.org/pods/project/sedona


Re: Installing Apache Sedona and its dependencies

2021-03-09 Thread Adam Binford
Can't comment on the runtime, but there was a bug that prevented global
indexing from being used in a lot of cases (see
https://github.com/apache/incubator-sedona/pull/511), including any
attempts from SQL directly. The non-indexed join is very memory inefficient
right now (it loads all points/inner objects into memory at once), which is
likely what caused the OOM issue. The DynamicIndex join is the most memory
efficient, but you need to use the RDD API directly. Not sure when the next
release will be but until then can't really do big joins from SQL.

Adam

On Mon, Mar 8, 2021 at 2:30 PM Robert Bozsik  wrote:

> Hi Jia,
>
> Thanks very much for your help!
> The setup worked, I managed to run Apache Sedona in a Jupyter Notebook :)
>
> However, another problem has occurred.
> I have two cases:
> 1. small join: gdf1 contains POLYGONs, shape: (250 rows, 3 columns), gdf2
> contains POINTs, shape: (2+ million rows, 5 columns)
> 2. big join: sdf1 contains POLYGONs, shape: (250 rows, 3 columns), sdf2
> contains POINs, shape: (56+ million rows, 5 columns)
>
> The "small join" takes 3 seconds to run in Geopandas, but 5 minutes to run
> in PySpark with Apache Sedona.
> All the setting are done based on this notebook:
>
> https://github.com/apache/incubator-sedona/blob/master/binder/ApacheSedonaSQL.ipynb
> , with these 3 lines in addition:
> spark.conf.set("sedona.global.index", "true")
> spark.conf.set("sedona.global.indextype", "rtree")
> spark.conf.set("sedona.join.gridtype", "kdbtree
> based on the setting of this file:
>
> https://github.com/iag-geo/spark_testing/blob/master/apache_sedona/02_run_spatial_query.py
>
> I also tried spatial partitioning, creating RDDs, then made JoinQuery,
> after that JoinQueryRaw as well, but it took again around 5 minutes.
>
> I tried out the "big join" with Apache Sedona. After an hour and a half, I
> received the following warnings and errors:
> WARN BlockManager: Block rdd_53_1 could not be removed as it was not found
> on disk or in memory
> ERROR Executor: Exception in task 1.0 in stage 11.0 (TID 187)
> java.lang.OutOfMemoryError: Java heap space
> ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread
> Thread[Executor task launch worker for task 187,5,main]
> java.lang.OutOfMemoryError: Java heap space
> WARN TaskSetManager: Lost task 1.0 in stage 11.0 (TID 187, roberts-mbp,
> executor driver): java.lang.OutOfMemoryError: Java heap space
> ERROR TaskSetManager: Task 1 in stage 11.0 failed 1 times; aborting job
> ERROR Utils: Uncaught exception in thread Executor task launch worker for
> task 190 java.lang.NullPointerException
>
> There should be some problem with my settings, but I cannot go forward
> without help from someone with more experience.
>
> Do you have any recommendations, what to read or how to try to make the
> "big join"?
>
> Have a nice evening,
> Robert
>
>
> On Sat, Mar 6, 2021 at 9:54 PM Jia Yu  wrote:
>
> > Hi Robert,
> >
> > The tutorial you found on our website is a step-by-step tutorial for
> > Python Jupyter. In that tutorial, pipenv will install all dependencies
> from
> > binder/Pipfile:
> > https://github.com/apache/incubator-sedona/tree/master/binder
> >
> > If you run into any specific issues, you can post here and we can help
> you.
> >
> > Thanks,
> > Jia
> >
> > On Sat, Mar 6, 2021 at 9:57 AM Robert Bozsik 
> > wrote:
> >
> >> Hi all,
> >> I am new to PySpark and programming.
> >> I would like to do a spatial join between two geographical datasets, one
> >> consists of 50+ million rows.
> >> Is there here anyone who could explain to me step by step how to install
> >> Apache Sedona (GeoSpark) and its dependencies on a Mac?
> >> After the installation I would like to run it locally in a virtual
> >> environment, first using Jupyter Notebook then in a .py file.
> >> On the official website I have found a quick start guide:
> >> https://sedona.apache.org/download/overview/
> >> and a Python Jupyter Notebook Examples guide:
> >> https://sedona.apache.org/tutorial/jupyter-notebook/
> >> However, it is still not clear, how to install and make it run.
> >> Unfortunately, I didn't find any useful step-by-step guide with the help
> >> of Google or YouTube and feel myself in an infinite loop of reading
> links
> >> after links that explain always different solutions.
> >> Thanks a lot in advance,
> >> Robert
> >> robertboz...@gmail.com
> >>
> >
>


-- 
Adam Binford