Hi All, So Will be there any performance difference instead of running hive java native udfs in spark-shell using hive context if we recode the entire logic to spark-sql code?
or spark is anyway converting hiev java udf to spark sql code so we dont need to rewrite the entire logic in spark-sql? On Mon, Feb 6, 2017 at 2:40 AM, kuassi mensah <kuassi.men...@oracle.com> wrote: > Apology in advance for injecting Oracle product in this discussion but I > thought it might help address the requirements (as far as I understood > these). > We are looking into furnishing for Spark a new connector similar to the > Oracle Datasource for Hadoop, > > <http://www.oracle.com/technetwork/database/database-technologies/bdc/oracle-datasource-4-hadoop-3158076.pdf>which > will implement the Spark DataSource interfaces for Oracle Database. > > In summary, it'll allow: > > - allow parallel and direct access to the Oracle database (with option > to control the number of concurrent connections) > - introspect the Oracle table then dynamically generate partitions of > Spark JDBCRDDs based on the split pattern and rewrite Spark SQL queries > into Oracle SQL queries for each partition. The typical use case consists > in joining fact data (or Big Data) with master data in Oracle. > - hooks in Oracle JDBC driver for faster type conversions > - Implement predicate pushdown, partition pruning, column projections > to the Oracle database, thereby reducing the amount of data to be processed > on Spark > - write back to Oracle table (through paralllel insert) the result of > SparkSQL processing for further mining by traditional BI tools. > > You may reach out to me offline for ore details if interested, > > Kuassi > > > On 1/29/2017 3:39 AM, Mich Talebzadeh wrote: > > This is classis nothing special about it. > > > 1. You source is Oracle schema tables > 2. You can use Oracle JDBC connection with DIRECT CONNECT and parallel > processing to read your data from Oracle table into Spark FP using JDBC. > Ensure that you are getting data from Oracle DB at a time when the DB is > not busy and network between your Spark and Oracle is reasonable. You will > be creating multiple connections to your Oracle database from Spark > 3. Create a DF from RDD and ingest your data into Hive staging tables. > This should be pretty fast. If you are using a recent version of Spark > > 1.5 you can see this in Spark GUI > 4. Once data is ingested into Hive table (frequency Discrete, > Recurring or Cumulative), then you have your source data in Hive > 5. Do your work in Hive staging tables and then your enriched data > will go into Hive enriched tables (different from your staging tables). You > can use Spark to enrich (transform) your data on Hive staging tables > 6. Then use Spark to send that data into Oracle table. Again bear in > mind that the application has to handle consistency from Big Data into > RDBMS. For example what you are going to do with failed transactions in > Oracle > 7. From my experience you also need some staging tables in Oracle to > handle inserts from Hive via Spark into Oracle table > 8. Finally run a job in PL/SQL to load Oracle target tables from > Oracle staging tables > > Notes: > > Oracle columns types are 100% compatible with Spark. For example Spark > does not recognize CHAR column and that has to be converted into VARCHAR or > STRING. > Hive does not have the concept of Oracle "WITH CLAUSE" inline table. So > that script that works in Oracle may not work in Hive. Windowing functions > should be fine. > > I tend to do all this via shell script that gives control at each layer > and creates alarms. > > HTH > > > > 1. > 2. > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 29 January 2017 at 10:18, Alex <siri8...@gmail.com> wrote: > >> Hi All, >> >> Thanks for your response .. Please find below flow diagram >> >> Please help me out simplifying this architecture using Spark >> >> 1) Can i skip step 1 to step 4 and directly store it in spark >> if I am storing it in spark where actually it is getting stored >> Do i need to retain HAdoop to store data >> or can i directly store it in spark and remove hadoop also? >> >> I want to remove informatica for preprocessing and directly load the >> files data coming from server to Hadoop/Spark >> >> So My Question is Can i directly load files data to spark ? Then where >> exactly the data will get stored.. Do I need to have Spark installed on Top >> of HDFS? >> >> 2) if I am retaining below architecture Can I store back output from >> spark directly to oracle from step 5 to step 7 >> >> and will spark way of storing it back to oracle will be better than using >> sqoop performance wise >> 3)Can I use SPark scala UDF to process data from hive and retain entire >> architecture >> >> which among the above would be optimal >> >> [image: Inline image 1] >> >> On Sat, Jan 28, 2017 at 10:38 PM, Sachin Naik <sachin.u.n...@gmail.com> >> wrote: >> >>> I strongly agree with Jorn and Russell. There are different solutions >>> for data movement depending upon your needs frequency, bi-directional >>> drivers. workflow, handling duplicate records. This is a space is known as >>> " Change Data Capture - CDC" for short. If you need more information, I >>> would be happy to chat with you. I built some products in this space that >>> extensively used connection pooling over ODBC/JDBC. >>> >>> Happy to chat if you need more information. >>> >>> -Sachin Naik >>> >>> >>Hard to tell. Can you give more insights >>on what you try to achieve >>> and what the data is about? >>> >>For example, depending on your use case sqoop can make sense or not. >>> Sent from my iPhone >>> >>> On Jan 27, 2017, at 11:22 PM, Russell Spitzer <russell.spit...@gmail.com> >>> wrote: >>> >>> You can treat Oracle as a JDBC source (http://spark.apache.org/docs/ >>> latest/sql-programming-guide.html#jdbc-to-other-databases) and skip >>> Sqoop, HiveTables and go straight to Queries. Then you can skip hive on the >>> way back out (see the same link) and write directly to Oracle. I'll leave >>> the performance questions for someone else. >>> >>> On Fri, Jan 27, 2017 at 11:06 PM Sirisha Cheruvu <siri8...@gmail.com> >>> wrote: >>> >>>> >>>> On Sat, Jan 28, 2017 at 6:44 AM, Sirisha Cheruvu <siri8...@gmail.com> >>>> wrote: >>>> >>>> Hi Team, >>>> >>>> RIght now our existing flow is >>>> >>>> Oracle-->Sqoop --> Hive--> Hive Queries on Spark-sql (Hive >>>> Context)-->Destination Hive table -->sqoop export to Oracle >>>> >>>> Half of the Hive UDFS required is developed in Java UDF.. >>>> >>>> SO Now I want to know if I run the native scala UDF's than runninng >>>> hive java udfs in spark-sql will there be any performance difference >>>> >>>> >>>> Can we skip the Sqoop Import and export part and >>>> >>>> Instead directly load data from oracle to spark and code Scala UDF's >>>> for transformations and export output data back to oracle? >>>> >>>> RIght now the architecture we are using is >>>> >>>> oracle-->Sqoop (Import)-->Hive Tables--> Hive Queries --> Spark-SQL--> >>>> Hive --> Oracle >>>> what would be optimal architecture to process data from oracle using >>>> spark ?? can i anyway better this process ? >>>> >>>> >>>> >>>> >>>> Regards, >>>> Sirisha >>>> >>>> >>>> >> > >