Why not directly access the S3 file from Spark?
You need to configure the IAM roles so that the machine running the S3 code is allowed to access the bucket. > Am 24.10.2018 um 06:40 schrieb Divya Gehlot <divya.htco...@gmail.com>: > > Hi Omer , > Here are couple of the solutions which you can implement for your use case : > Option 1 : > you can mount the S3 bucket as local file system > Here are the details : > https://cloud.netapp.com/blog/amazon-s3-as-a-file-system > Option 2 : > You can use Amazon Glue for your use case > here are the details : > https://aws.amazon.com/blogs/big-data/how-to-access-and-analyze-on-premises-data-stores-using-aws-glue/ > > Option 3 : > Store the file in the local file system and later push it s3 bucket > here are the details > https://stackoverflow.com/questions/48067979/simplest-way-to-fetch-the-file-from-ftp-server-on-prem-put-into-s3-bucket > > Thanks, > Divya > >> On Tue, 23 Oct 2018 at 15:53, <omer.ozsaka...@sony.com> wrote: >> Hi guys, >> >> >> >> We are using Apache Spark on a local machine. >> >> >> >> I need to implement the scenario below. >> >> >> >> In the initial load: >> >> CRM application will send a file to a folder. This file contains customer >> information of all customers. This file is in a folder in the local server. >> File name is: customer.tsv >> Customer.tsv contains customerid, country, birty_month, activation_date etc >> I need to read the contents of customer.tsv. >> I will add current timestamp info to the file. >> I will transfer customer.tsv to the S3 bucket: customer.history.data >> >> >> In the daily loads: >> >> CRM application will send a new file which contains the >> updated/deleted/inserted customer information. >> File name is daily_customer.tsv >> >> Daily_customer.tsv contains contains customerid, cdc_field, country, >> birty_month, activation_date etc >> Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted. >> >> I need to read the contents of daily_customer.tsv. >> I will add current timestamp info to the file. >> I will transfer daily_customer.tsv to the S3 bucket: customer.daily.data >> I need to merge two buckets customer.history.data and customer.daily.data. >> Two buckets have timestamp fields. So I need to query all records whose >> timestamp is the last timestamp. >> I can use row_number() over(partition by customer_id order by >> timestamp_field desc) as version_number >> Then I can put the records whose version is one, to the final bucket: >> customer.dimension.data >> >> >> I am running Spark on premise. >> >> Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD on a >> local Spark cluster? >> Is this approach efficient? Will the queries transfer all historical data >> from AWS S3 to the local cluster? >> How can I implement this scenario in a more effective way? Like just >> transferring daily data to AWS S3 and then running queries on AWS. >> For instance Athena can query on AWS. But it is just a query engine. As I >> know I can not call it by using an sdk and I can not write the results to a >> bucket/folder. >> >> >> Thanks in advance, >> >> Ömer >> >> >> >> >> >> >> >>