you can use Spark directly on csv file.

   1. Put the csv files into HDFS /apps/<PROJECT>/data/staging/<TABLE_NAME>
   2. Multiple csv files for the same table can co-exist
   3. like df1 = spark.read.option("header", false).csv(location)
   4.


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 29 January 2017 at 14:37, Alex <siri8...@gmail.com> wrote:

> But for persistance after intermediate processing can i use spark cluster
> itself or i have to use hadoop cluster?!
>
> On Jan 29, 2017 7:36 PM, "Deepak Sharma" <deepakmc...@gmail.com> wrote:
>
> The better way is to read the data directly into spark using spark sql
> read jdbc .
> Apply the udf's locally .
> Then save the data frame back to Oracle using dataframe's write jdbc.
>
> Thanks
> Deepak
>
> On Jan 29, 2017 7:15 PM, "Jörn Franke" <jornfra...@gmail.com> wrote:
>
>> One alternative could be the oracle Hadoop loader and other Oracle
>> products, but you have to invest some money and probably buy their Hadoop
>> Appliance, which you have to evaluate if it make sense (can get expensive
>> with large clusters etc).
>>
>> Another alternative would be to get rid of Oracle alltogether and use
>> other databases.
>>
>> However, can you elaborate a little bit on your use case and the business
>> logic as well as SLA requires. Otherwise all recommendations are right
>> because the requirements you presented are very generic.
>>
>> About get rid of Hadoop - this depends! You will need some resource
>> manager (yarn, mesos, kubernetes etc) and most likely also a distributed
>> file system. Spark supports through the Hadoop apis a wide range of file
>> systems, but does not need HDFS for persistence. You can have local
>> filesystem (ie any file system mounted to a node, so also distributed ones,
>> such as zfs), cloud file systems (s3, azure blob etc).
>>
>>
>>
>> On 29 Jan 2017, at 11:18, Alex <siri8...@gmail.com> wrote:
>>
>> Hi All,
>>
>> Thanks for your response .. Please find below flow diagram
>>
>> Please help me out simplifying this architecture using Spark
>>
>> 1) Can i skip step 1 to step 4 and directly store it in spark
>> if I am storing it in spark where actually it is getting stored
>> Do i need to retain HAdoop to store data
>> or can i directly store it in spark and remove hadoop also?
>>
>> I want to remove informatica for preprocessing and directly load the
>> files data coming from server to Hadoop/Spark
>>
>> So My Question is Can i directly load files data to spark ? Then where
>> exactly the data will get stored.. Do I need to have Spark installed on Top
>> of HDFS?
>>
>> 2) if I am retaining below architecture Can I store back output from
>> spark directly to oracle from step 5 to step 7
>>
>> and will spark way of storing it back to oracle will be better than using
>> sqoop performance wise
>> 3)Can I use SPark scala UDF to process data from hive and retain entire
>> architecture
>>
>> which among the above would be optimal
>>
>> [image: Inline image 1]
>>
>> On Sat, Jan 28, 2017 at 10:38 PM, Sachin Naik <sachin.u.n...@gmail.com>
>> wrote:
>>
>>> I strongly agree with Jorn and Russell. There are different solutions
>>> for data movement depending upon your needs frequency, bi-directional
>>> drivers. workflow, handling duplicate records. This is a space is known as
>>> " Change Data Capture - CDC" for short. If you need more information, I
>>> would be happy to chat with you.  I built some products in this space that
>>> extensively used connection pooling over ODBC/JDBC.
>>>
>>> Happy to chat if you need more information.
>>>
>>> -Sachin Naik
>>>
>>> >>Hard to tell. Can you give more insights >>on what you try to achieve
>>> and what the data is about?
>>> >>For example, depending on your use case sqoop can make sense or not.
>>> Sent from my iPhone
>>>
>>> On Jan 27, 2017, at 11:22 PM, Russell Spitzer <russell.spit...@gmail.com>
>>> wrote:
>>>
>>> You can treat Oracle as a JDBC source (http://spark.apache.org/docs/
>>> latest/sql-programming-guide.html#jdbc-to-other-databases) and skip
>>> Sqoop, HiveTables and go straight to Queries. Then you can skip hive on the
>>> way back out (see the same link) and write directly to Oracle. I'll leave
>>> the performance questions for someone else.
>>>
>>> On Fri, Jan 27, 2017 at 11:06 PM Sirisha Cheruvu <siri8...@gmail.com>
>>> wrote:
>>>
>>>>
>>>> On Sat, Jan 28, 2017 at 6:44 AM, Sirisha Cheruvu <siri8...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi Team,
>>>>
>>>> RIght now our existing flow is
>>>>
>>>> Oracle-->Sqoop --> Hive--> Hive Queries on Spark-sql (Hive
>>>> Context)-->Destination Hive table -->sqoop export to Oracle
>>>>
>>>> Half of the Hive UDFS required is developed in Java UDF..
>>>>
>>>> SO Now I want to know if I run the native scala UDF's than runninng
>>>> hive java udfs in spark-sql will there be any performance difference
>>>>
>>>>
>>>> Can we skip the Sqoop Import and export part and
>>>>
>>>> Instead directly load data from oracle to spark and code Scala UDF's
>>>> for transformations and export output data back to oracle?
>>>>
>>>> RIght now the architecture we are using is
>>>>
>>>> oracle-->Sqoop (Import)-->Hive Tables--> Hive Queries --> Spark-SQL-->
>>>> Hive --> Oracle
>>>> what would be optimal architecture to process data from oracle using
>>>> spark ?? can i anyway better this process ?
>>>>
>>>>
>>>>
>>>>
>>>> Regards,
>>>> Sirisha
>>>>
>>>>
>>>>
>>
>

Reply via email to