Why not directly access the S3 file from Spark?

You need to configure the IAM roles so that the machine running the S3 code is 
allowed to access the bucket.

> Am 24.10.2018 um 06:40 schrieb Divya Gehlot <divya.htco...@gmail.com>:
> 
> Hi Omer ,
> Here are couple of the solutions which you can implement for your use case : 
> Option 1 : 
> you can mount the S3 bucket as local file system 
> Here are the details : 
> https://cloud.netapp.com/blog/amazon-s3-as-a-file-system
> Option 2 :
>  You can use Amazon Glue for your use case 
> here are the details : 
> https://aws.amazon.com/blogs/big-data/how-to-access-and-analyze-on-premises-data-stores-using-aws-glue/
> 
> Option 3 :
> Store the file in the local file system and later push it s3 bucket 
> here are the details 
> https://stackoverflow.com/questions/48067979/simplest-way-to-fetch-the-file-from-ftp-server-on-prem-put-into-s3-bucket
> 
> Thanks,
> Divya 
> 
>> On Tue, 23 Oct 2018 at 15:53, <omer.ozsaka...@sony.com> wrote:
>> Hi guys,
>> 
>>  
>> 
>> We are using Apache Spark on a local machine.
>> 
>>  
>> 
>> I need to implement the scenario below.
>> 
>>  
>> 
>> In the initial load:
>> 
>> CRM application will send a file to a folder. This file contains customer 
>> information of all customers. This file is in a folder in the local server. 
>> File name is: customer.tsv
>> Customer.tsv contains customerid, country, birty_month, activation_date etc
>> I need to read the contents of customer.tsv.
>> I will add current timestamp info to the file.
>> I will transfer customer.tsv to the S3 bucket: customer.history.data
>>  
>> 
>> In the daily loads:
>> 
>>  CRM application will send a new file which contains the 
>> updated/deleted/inserted customer information.
>>   File name is daily_customer.tsv
>> 
>> Daily_customer.tsv contains contains customerid, cdc_field, country, 
>> birty_month, activation_date etc
>> Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted.
>> 
>> I need to read the contents of daily_customer.tsv.
>> I will add current timestamp info to the file.
>> I will transfer daily_customer.tsv to the S3 bucket: customer.daily.data
>> I need to merge two buckets customer.history.data and customer.daily.data.
>> Two buckets have timestamp fields. So I need to query all records whose 
>> timestamp is the last timestamp.
>> I can use row_number() over(partition by customer_id order by 
>> timestamp_field desc) as version_number
>> Then I can put the records whose version is one, to the final bucket: 
>> customer.dimension.data
>>  
>> 
>> I am running Spark on premise.
>> 
>> Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD on a 
>> local Spark cluster?
>> Is this approach efficient? Will the queries transfer all historical data 
>> from AWS S3 to the local cluster?
>> How can I implement this scenario in a more effective way? Like just 
>> transferring daily data to AWS S3 and then running queries on AWS.
>> For instance Athena can query on AWS. But it is just a query engine. As I 
>> know I can not call it by using an sdk and I can not write the results to a 
>> bucket/folder.
>>  
>> 
>> Thanks in advance,
>> 
>> Ömer
>> 
>>  
>> 
>>            
>> 
>>  
>> 
>>  

Reply via email to