Thanks for this very thorough write-up and for continuing to update it as
you progress! As I said in the other thread it would be great to do a
little profiling to see if we can get to the heart of the slowness with
nested case classes (very little optimization has been done in this code
path).
I just put up a repo with a write-up on how to import the GDELT public
dataset into Spark SQL and play around. Has a lot of notes on
different import methods and observations about Spark SQL. Feel free
to have a look and comment.
http://www.github.com/velvia/spark-sql-gdelt