Re: spark architecture question -- Pleas Read

Alex Tue, 07 Feb 2017 01:24:08 -0800

Hi All,

So Will be there any performance difference instead of running hive java
native udfs in spark-shell using hive context if we recode the entire logic
to spark-sql code?


or spark is anyway converting hiev java udf to spark sql code so we dont
need to rewrite the entire logic in spark-sql?

On Mon, Feb 6, 2017 at 2:40 AM, kuassi mensah <kuassi.men...@oracle.com>
wrote:

> Apology in advance for injecting Oracle product in this discussion but I
> thought it might help address the requirements (as far as I understood
> these).
> We are looking into furnishing for Spark a new connector similar to the
> Oracle Datasource for Hadoop,
>
> <http://www.oracle.com/technetwork/database/database-technologies/bdc/oracle-datasource-4-hadoop-3158076.pdf>which
> will implement the Spark DataSource interfaces for Oracle Database.
>
> In summary, it'll allow:
>
>    - allow parallel and direct access to the Oracle database (with option
>    to control the number of concurrent connections)
>    - introspect the Oracle table then dynamically generate partitions of
>    Spark JDBCRDDs based on the split pattern and rewrite Spark SQL queries
>    into Oracle SQL queries for each partition. The typical use case consists
>    in joining fact data (or Big Data) with master data in Oracle.
>    - hooks in Oracle JDBC driver for faster type conversions
>    - Implement predicate pushdown, partition pruning, column projections
>    to the Oracle database, thereby reducing the amount of data to be processed
>    on Spark
>    - write back to Oracle table (through paralllel insert) the result of
>    SparkSQL processing for further mining by traditional BI tools.
>
> You may reach out to me offline for ore details if interested,
>
> Kuassi
>
>
> On 1/29/2017 3:39 AM, Mich Talebzadeh wrote:
>
> This is classis nothing special about it.
>
>
>    1. You source is Oracle schema tables
>    2. You can use Oracle JDBC connection with DIRECT CONNECT and parallel
>    processing to read your data from Oracle table into Spark FP using JDBC.
>    Ensure that you are getting data from Oracle DB at a time when the DB is
>    not busy and network between your Spark and Oracle is reasonable. You will
>    be creating multiple connections to your Oracle database from Spark
>    3. Create a DF from RDD and ingest your data into Hive staging tables.
>    This should be pretty fast. If you are using a recent version of Spark >
>    1.5 you can see this in Spark GUI
>    4. Once data is ingested into Hive table (frequency Discrete,
>    Recurring or Cumulative), then you have your source data in Hive
>    5. Do your work in Hive staging tables and then your enriched data
>    will go into Hive enriched tables (different from your staging tables). You
>    can use Spark to enrich (transform) your data on Hive staging tables
>    6. Then use Spark to send that data into Oracle table. Again bear in
>    mind that the application has to handle consistency from Big Data into
>    RDBMS. For example what you are going to do with failed transactions in
>    Oracle
>    7. From my experience you also need some  staging tables in Oracle to
>    handle inserts from Hive via Spark into Oracle table
>    8. Finally run a job in PL/SQL to load Oracle target tables from
>    Oracle staging tables
>
> Notes:
>
> Oracle columns types are 100% compatible with Spark. For example Spark
> does not recognize CHAR column and that has to be converted into VARCHAR or
> STRING.
> Hive does not have the concept of Oracle "WITH CLAUSE" inline table. So
> that script that works in Oracle may not work in Hive. Windowing functions
> should be fine.
>
> I tend to do all this via shell script that gives control at each layer
> and creates alarms.
>
> HTH
>
>
>
>    1.
>    2.
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 29 January 2017 at 10:18, Alex <siri8...@gmail.com> wrote:
>
>> Hi All,
>>
>> Thanks for your response .. Please find below flow diagram
>>
>> Please help me out simplifying this architecture using Spark
>>
>> 1) Can i skip step 1 to step 4 and directly store it in spark
>> if I am storing it in spark where actually it is getting stored
>> Do i need to retain HAdoop to store data
>> or can i directly store it in spark and remove hadoop also?
>>
>> I want to remove informatica for preprocessing and directly load the
>> files data coming from server to Hadoop/Spark
>>
>> So My Question is Can i directly load files data to spark ? Then where
>> exactly the data will get stored.. Do I need to have Spark installed on Top
>> of HDFS?
>>
>> 2) if I am retaining below architecture Can I store back output from
>> spark directly to oracle from step 5 to step 7
>>
>> and will spark way of storing it back to oracle will be better than using
>> sqoop performance wise
>> 3)Can I use SPark scala UDF to process data from hive and retain entire
>> architecture
>>
>> which among the above would be optimal
>>
>> [image: Inline image 1]
>>
>> On Sat, Jan 28, 2017 at 10:38 PM, Sachin Naik <sachin.u.n...@gmail.com>
>> wrote:
>>
>>> I strongly agree with Jorn and Russell. There are different solutions
>>> for data movement depending upon your needs frequency, bi-directional
>>> drivers. workflow, handling duplicate records. This is a space is known as
>>> " Change Data Capture - CDC" for short. If you need more information, I
>>> would be happy to chat with you.  I built some products in this space that
>>> extensively used connection pooling over ODBC/JDBC.
>>>
>>> Happy to chat if you need more information.
>>>
>>> -Sachin Naik
>>>
>>> >>Hard to tell. Can you give more insights >>on what you try to achieve
>>> and what the data is about?
>>> >>For example, depending on your use case sqoop can make sense or not.
>>> Sent from my iPhone
>>>
>>> On Jan 27, 2017, at 11:22 PM, Russell Spitzer <russell.spit...@gmail.com>
>>> wrote:
>>>
>>> You can treat Oracle as a JDBC source (http://spark.apache.org/docs/
>>> latest/sql-programming-guide.html#jdbc-to-other-databases) and skip
>>> Sqoop, HiveTables and go straight to Queries. Then you can skip hive on the
>>> way back out (see the same link) and write directly to Oracle. I'll leave
>>> the performance questions for someone else.
>>>
>>> On Fri, Jan 27, 2017 at 11:06 PM Sirisha Cheruvu <siri8...@gmail.com>
>>> wrote:
>>>
>>>>
>>>> On Sat, Jan 28, 2017 at 6:44 AM, Sirisha Cheruvu <siri8...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi Team,
>>>>
>>>> RIght now our existing flow is
>>>>
>>>> Oracle-->Sqoop --> Hive--> Hive Queries on Spark-sql (Hive
>>>> Context)-->Destination Hive table -->sqoop export to Oracle
>>>>
>>>> Half of the Hive UDFS required is developed in Java UDF..
>>>>
>>>> SO Now I want to know if I run the native scala UDF's than runninng
>>>> hive java udfs in spark-sql will there be any performance difference
>>>>
>>>>
>>>> Can we skip the Sqoop Import and export part and
>>>>
>>>> Instead directly load data from oracle to spark and code Scala UDF's
>>>> for transformations and export output data back to oracle?
>>>>
>>>> RIght now the architecture we are using is
>>>>
>>>> oracle-->Sqoop (Import)-->Hive Tables--> Hive Queries --> Spark-SQL-->
>>>> Hive --> Oracle
>>>> what would be optimal architecture to process data from oracle using
>>>> spark ?? can i anyway better this process ?
>>>>
>>>>
>>>>
>>>>
>>>> Regards,
>>>> Sirisha
>>>>
>>>>
>>>>
>>
>
>

Re: spark architecture question -- Pleas Read

Reply via email to