Hi all,
I've been playing around with Sedona a little bit, and first want to say
thanks to all the work that has gone into this! I was able to get a spatial
join working on several hundred billion points in under 30 minutes (time is
relatively based on resources obviously). I did have to use the RDD API
directly to get around the bug I made a PR
<https://github.com/apache/incubator-sedona/pull/511> for. I wanted to
share some initial thoughts/questions/ideas after looking through the code
and playing around with it. While it would be cool to try to work on some
of these features, I probably won't have time to dedicate to it, but wanted
to bring them up at least in case I do or someone else wants to work them.
- The dynamic index join is the only join type that handles large
datasets remotely well by creating an iterator. You only have up to the
number of windows you have worth of rows in memory at one time. The
precomputed index joins will keep all results in memory until you are done
with the partition, while the nested loop join loads all points into memory
before starting to join (which is what currently prevents SQL joins from
working on large datasets at all). Possible options are:
- Use the dynamic index approach of defining an iterator to handle
things lazily. Might offer the best performance but adds a fair amount of
complexity to do properly.
- Use Java streams. It's a very simple flatMap/map operation, but I
don't know the performance impact.
- Use Scala iterators for the same flatMap/map operation which is
what's used in a lot of the Spark code, not sure if it's more performant
than Java streams?
- I saw there was already talk about a broadcast join, that would be
pretty cool.
- The only way to configure join parameters from SQL is through the
static Spark configuration. It would be cool to be able to change those
dynamically to test out different options for performance without having to
recreate your spark context. Options:
- Use the SQL RuntimeConfig to build the SedonaConf before a join.
- Use join hints to somehow override these configs. Absolutely no
idea if this is possible or how complicated it would be but it
sounds cool.
- Package the sedona python package in the sedona-python-adapter jar.
This is what Delta OSS does and makes it easy to use the Python modules by
just including the jar as a package. Obviously the attrs and shapely
dependencies make this trickier, so might not be as feasible.
Curious what others think.
--
Adam