Re: spark architecture question -- Pleas Read

Mich Talebzadeh Sun, 29 Jan 2017 03:40:35 -0800

This is classis nothing special about it.

   1. You source is Oracle schema tables
   2. You can use Oracle JDBC connection with DIRECT CONNECT and parallel
   processing to read your data from Oracle table into Spark FP using JDBC.
   Ensure that you are getting data from Oracle DB at a time when the DB is
   not busy and network between your Spark and Oracle is reasonable. You will
   be creating multiple connections to your Oracle database from Spark
   3. Create a DF from RDD and ingest your data into Hive staging tables.
   This should be pretty fast. If you are using a recent version of Spark >
   1.5 you can see this in Spark GUI
   4. Once data is ingested into Hive table (frequency Discrete, Recurring
   or Cumulative), then you have your source data in Hive
   5. Do your work in Hive staging tables and then your enriched data will
   go into Hive enriched tables (different from your staging tables). You can
   use Spark to enrich (transform) your data on Hive staging tables
   6. Then use Spark to send that data into Oracle table. Again bear in
   mind that the application has to handle consistency from Big Data into
   RDBMS. For example what you are going to do with failed transactions in
   Oracle
   7. From my experience you also need some  staging tables in Oracle to
   handle inserts from Hive via Spark into Oracle table
   8. Finally run a job in PL/SQL to load Oracle target tables from Oracle
   staging tables

Notes:

Oracle columns types are 100% compatible with Spark. For example Spark does
not recognize CHAR column and that has to be converted into VARCHAR or
STRING.
Hive does not have the concept of Oracle "WITH CLAUSE" inline table. So
that script that works in Oracle may not work in Hive. Windowing functions
should be fine.

I tend to do all this via shell script that gives control at each layer and
creates alarms.

HTH

   1.
   2.

Dr Mich Talebzadeh

LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

http://talebzadehmich.wordpress.com

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On 29 January 2017 at 10:18, Alex <siri8...@gmail.com> wrote:

> Hi All,
>
> Thanks for your response .. Please find below flow diagram
>
> Please help me out simplifying this architecture using Spark
>
> 1) Can i skip step 1 to step 4 and directly store it in spark
> if I am storing it in spark where actually it is getting stored
> Do i need to retain HAdoop to store data
> or can i directly store it in spark and remove hadoop also?
>
> I want to remove informatica for preprocessing and directly load the files
> data coming from server to Hadoop/Spark
>
> So My Question is Can i directly load files data to spark ? Then where
> exactly the data will get stored.. Do I need to have Spark installed on Top
> of HDFS?
>
> 2) if I am retaining below architecture Can I store back output from spark
> directly to oracle from step 5 to step 7
>
> and will spark way of storing it back to oracle will be better than using
> sqoop performance wise
> 3)Can I use SPark scala UDF to process data from hive and retain entire
> architecture
>
> which among the above would be optimal
>
> [image: Inline image 1]
>
> On Sat, Jan 28, 2017 at 10:38 PM, Sachin Naik <sachin.u.n...@gmail.com>
> wrote:
>
>> I strongly agree with Jorn and Russell. There are different solutions for
>> data movement depending upon your needs frequency, bi-directional drivers.
>> workflow, handling duplicate records. This is a space is known as " Change
>> Data Capture - CDC" for short. If you need more information, I would be
>> happy to chat with you.  I built some products in this space that
>> extensively used connection pooling over ODBC/JDBC.
>>
>> Happy to chat if you need more information.
>>
>> -Sachin Naik
>>
>> >>Hard to tell. Can you give more insights >>on what you try to achieve
>> and what the data is about?
>> >>For example, depending on your use case sqoop can make sense or not.
>> Sent from my iPhone
>>
>> On Jan 27, 2017, at 11:22 PM, Russell Spitzer <russell.spit...@gmail.com>
>> wrote:
>>
>> You can treat Oracle as a JDBC source (http://spark.apache.org/docs/
>> latest/sql-programming-guide.html#jdbc-to-other-databases) and skip
>> Sqoop, HiveTables and go straight to Queries. Then you can skip hive on the
>> way back out (see the same link) and write directly to Oracle. I'll leave
>> the performance questions for someone else.
>>
>> On Fri, Jan 27, 2017 at 11:06 PM Sirisha Cheruvu <siri8...@gmail.com>
>> wrote:
>>
>>>
>>> On Sat, Jan 28, 2017 at 6:44 AM, Sirisha Cheruvu <siri8...@gmail.com>
>>> wrote:
>>>
>>> Hi Team,
>>>
>>> RIght now our existing flow is
>>>
>>> Oracle-->Sqoop --> Hive--> Hive Queries on Spark-sql (Hive
>>> Context)-->Destination Hive table -->sqoop export to Oracle
>>>
>>> Half of the Hive UDFS required is developed in Java UDF..
>>>
>>> SO Now I want to know if I run the native scala UDF's than runninng hive
>>> java udfs in spark-sql will there be any performance difference
>>>
>>>
>>> Can we skip the Sqoop Import and export part and
>>>
>>> Instead directly load data from oracle to spark and code Scala UDF's for
>>> transformations and export output data back to oracle?
>>>
>>> RIght now the architecture we are using is
>>>
>>> oracle-->Sqoop (Import)-->Hive Tables--> Hive Queries --> Spark-SQL-->
>>> Hive --> Oracle
>>> what would be optimal architecture to process data from oracle using
>>> spark ?? can i anyway better this process ?
>>>
>>>
>>>
>>>
>>> Regards,
>>> Sirisha
>>>
>>>
>>>
>

Re: spark architecture question -- Pleas Read

Reply via email to