SQL Ideas

Adam Binford Mon, 01 Mar 2021 14:26:10 -0800

Hi all,

I've been playing around with Sedona a little bit, and first want to say
thanks to all the work that has gone into this! I was able to get a spatial
join working on several hundred billion points in under 30 minutes (time is
relatively based on resources obviously). I did have to use the RDD API
directly to get around the bug I made a PR
<https://github.com/apache/incubator-sedona/pull/511> for. I wanted to
share some initial thoughts/questions/ideas after looking through the code
and playing around with it. While it would be cool to try to work on some
of these features, I probably won't have time to dedicate to it, but wanted
to bring them up at least in case I do or someone else wants to work them.



   - The dynamic index join is the only join type that handles large
   datasets remotely well by creating an iterator. You only have up to the
   number of windows you have worth of rows in memory at one time. The
   precomputed index joins will keep all results in memory until you are done
   with the partition, while the nested loop join loads all points into memory
   before starting to join (which is what currently prevents SQL joins from
   working on large datasets at all). Possible options are:
      - Use the dynamic index approach of defining an iterator to handle
      things lazily. Might offer the best performance but adds a fair amount of
      complexity to do properly.
      - Use Java streams. It's a very simple flatMap/map operation, but I
      don't know the performance impact.
      - Use Scala iterators for the same flatMap/map operation which is
      what's used in a lot of the Spark code, not sure if it's more performant
      than Java streams?
   - I saw there was already talk about a broadcast join, that would be
   pretty cool.
   - The only way to configure join parameters from SQL is through the
   static Spark configuration. It would be cool to be able to change those
   dynamically to test out different options for performance without having to
   recreate your spark context. Options:
      - Use the SQL RuntimeConfig to build the SedonaConf before a join.
      - Use join hints to somehow override these configs. Absolutely no
      idea if this is possible or how complicated it would be but it
sounds cool.
   - Package the sedona python package in the sedona-python-adapter jar.
   This is what Delta OSS does and makes it easy to use the Python modules by
   just including the jar as a package. Obviously the attrs and shapely
   dependencies make this trickier, so might not be as feasible.

Curious what others think.

-- 
Adam

SQL Ideas

Reply via email to