First of all, how long do you want to keep doing this? The data is going to increase infinitely and without any bounds, its going to get too big for any cluster to handle. If all that is within bounds, then try the following.
- Maintain a global variable having the current RDD storing all the log data. We are going to keep updating this variable. - Every batch interval, take new data and union it with the earlier unified RDD (in the global variable) and update the global variable. If you want sequel queries on this data, then you will have re-register this new RDD as the named table. - With this approach the number of partitions is going to increase rapidly. So periodically take the unified RDD and repartition it to a smaller set of partitions. This messes up the ordering of data, but you maybe fine with if your queries are order agnostic. Also, periodically, checkpoint this RDD, otherwise the lineage is going to grow indefinitely and everything will start getting slower. Hope this helps. TD On Mon, Dec 8, 2014 at 6:29 PM, Xuelin Cao <xuelin...@yahoo.com.invalid> wrote: > > Hi, > > I'm wondering whether there is an efficient way to continuously > append new data to a registered spark SQL table. > > This is what I want: > I want to make an ad-hoc query service to a json formated system log. > Certainly, the system log is continuously generated. I will use spark > streaming to connect the system log as my input, and I want to find a way to > effectively append the new data into an existed spark SQL table. Further > more, I want the whole table being cached in memory/tachyon. > > It looks like spark sql supports the "INSERT" method, but only for > parquet file. In addition, it is inefficient to insert a single row every > time. > > I do know that somebody build a similar system that I want (ad-hoc > query service to a on growing system log). So, there must be an efficient > way. Anyone knows? > > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org