Re: Spark_Usecase

2016-06-07 Thread vaquar khan
Deepak Spark does provide support to incremental load,if users want to schedule their batch jobs frequently and want to have incremental load of their data from databases. You will not get good performance to update your Spark SQL tables backed by files. Instead, you can use message queues and

Re: Spark_Usecase

2016-06-07 Thread Ajay Chander
Hi Deepak, thanks for the info. I was thinking of reading both source and destination tables into separate rdds/dataframes, then apply some specific transformations to find the updated info, remove updated keyed rows from destination and append updated info to the destination. Any pointers on this

Re: Spark_Usecase

2016-06-07 Thread Deepak Sharma
I am not sure if Spark provides any support for incremental extracts inherently. But you can maintain a file e.g. extractRange.conf in hdfs , to read from it the end range and update it with new end range from spark job before it finishes with the new relevant ranges to be used next time. On

Re: Spark_Usecase

2016-06-07 Thread Ajay Chander
Hi Mich, thanks for your inputs. I used sqoop to get the data from MySQL. Now I am using spark to do the same. Right now, I am trying to implement incremental updates while loading from MySQL through spark. Can you suggest any best practices for this ? Thank you. On Tuesday, June 7, 2016, Mich

Re: Spark_Usecase

2016-06-07 Thread Mich Talebzadeh
I use Spark rather that Sqoop to import data from an Oracle table into a Hive ORC table. It used JDBC for this purpose. All inclusive in Scala itself. Also Hive runs on Spark engine. Order of magnitude faster with Inde on map-reduce/. pretty simple. HTH Dr Mich Talebzadeh LinkedIn *

Re: Spark_Usecase

2016-06-07 Thread Ajay Chander
Marco, Ted, thanks for your time. I am sorry if I wasn't clear enough. We have two sources, 1) sql server 2) files are pushed onto edge node by upstreams on a daily basis. Point 1 has been achieved by using JDBC format in spark sql. Point 2 has been achieved by using shell script. My only

Re: Spark_Usecase

2016-06-07 Thread Ted Yu
bq. load the data from edge node to hdfs Does the loading involve accessing sqlserver ? Please take a look at https://spark.apache.org/docs/latest/sql-programming-guide.html On Tue, Jun 7, 2016 at 7:19 AM, Marco Mistroni wrote: > Hi > how about > > 1. have a process that

Re: Spark_Usecase

2016-06-07 Thread Marco Mistroni
Hi how about 1. have a process that read the data from your sqlserver and dumps it as a file into a directory on your hd 2. use spark-streanming to read data from that directory and store it into hdfs perhaps there is some sort of spark 'connectors' that allows you to read data from a db

Spark_Usecase

2016-06-07 Thread Ajay Chander
Hi Spark users, Right now we are using spark for everything(loading the data from sqlserver, apply transformations, save it as permanent tables in hive) in our environment. Everything is being done in one spark application. The only thing we do before we launch our spark application through