Re: Triggering sql on Was S3 via Apache Spark

2018-10-23 Thread Jörn Franke
Why not directly access the S3 file from Spark?


You need to configure the IAM roles so that the machine running the S3 code is 
allowed to access the bucket.

> Am 24.10.2018 um 06:40 schrieb Divya Gehlot :
> 
> Hi Omer ,
> Here are couple of the solutions which you can implement for your use case : 
> Option 1 : 
> you can mount the S3 bucket as local file system 
> Here are the details : 
> https://cloud.netapp.com/blog/amazon-s3-as-a-file-system
> Option 2 :
>  You can use Amazon Glue for your use case 
> here are the details : 
> https://aws.amazon.com/blogs/big-data/how-to-access-and-analyze-on-premises-data-stores-using-aws-glue/
> 
> Option 3 :
> Store the file in the local file system and later push it s3 bucket 
> here are the details 
> https://stackoverflow.com/questions/48067979/simplest-way-to-fetch-the-file-from-ftp-server-on-prem-put-into-s3-bucket
> 
> Thanks,
> Divya 
> 
>> On Tue, 23 Oct 2018 at 15:53,  wrote:
>> Hi guys,
>> 
>>  
>> 
>> We are using Apache Spark on a local machine.
>> 
>>  
>> 
>> I need to implement the scenario below.
>> 
>>  
>> 
>> In the initial load:
>> 
>> CRM application will send a file to a folder. This file contains customer 
>> information of all customers. This file is in a folder in the local server. 
>> File name is: customer.tsv
>> Customer.tsv contains customerid, country, birty_month, activation_date etc
>> I need to read the contents of customer.tsv.
>> I will add current timestamp info to the file.
>> I will transfer customer.tsv to the S3 bucket: customer.history.data
>>  
>> 
>> In the daily loads:
>> 
>>  CRM application will send a new file which contains the 
>> updated/deleted/inserted customer information.
>>   File name is daily_customer.tsv
>> 
>> Daily_customer.tsv contains contains customerid, cdc_field, country, 
>> birty_month, activation_date etc
>> Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted.
>> 
>> I need to read the contents of daily_customer.tsv.
>> I will add current timestamp info to the file.
>> I will transfer daily_customer.tsv to the S3 bucket: customer.daily.data
>> I need to merge two buckets customer.history.data and customer.daily.data.
>> Two buckets have timestamp fields. So I need to query all records whose 
>> timestamp is the last timestamp.
>> I can use row_number() over(partition by customer_id order by 
>> timestamp_field desc) as version_number
>> Then I can put the records whose version is one, to the final bucket: 
>> customer.dimension.data
>>  
>> 
>> I am running Spark on premise.
>> 
>> Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD on a 
>> local Spark cluster?
>> Is this approach efficient? Will the queries transfer all historical data 
>> from AWS S3 to the local cluster?
>> How can I implement this scenario in a more effective way? Like just 
>> transferring daily data to AWS S3 and then running queries on AWS.
>> For instance Athena can query on AWS. But it is just a query engine. As I 
>> know I can not call it by using an sdk and I can not write the results to a 
>> bucket/folder.
>>  
>> 
>> Thanks in advance,
>> 
>> Ömer
>> 
>>  
>> 
>>
>> 
>>  
>> 
>>  


Re: Triggering sql on Was S3 via Apache Spark

2018-10-23 Thread Divya Gehlot
Hi Omer ,
Here are couple of the solutions which you can implement for your use case
:
*Option 1 : *
you can mount the S3 bucket as local file system
Here are the details :
https://cloud.netapp.com/blog/amazon-s3-as-a-file-system
*Option 2 :*
 You can use Amazon Glue for your use case
here are the details :
https://aws.amazon.com/blogs/big-data/how-to-access-and-analyze-on-premises-data-stores-using-aws-glue/

*Option 3 :*
Store the file in the local file system and later push it s3 bucket
here are the details
https://stackoverflow.com/questions/48067979/simplest-way-to-fetch-the-file-from-ftp-server-on-prem-put-into-s3-bucket

Thanks,
Divya

On Tue, 23 Oct 2018 at 15:53,  wrote:

> Hi guys,
>
>
>
> We are using Apache Spark on a local machine.
>
>
>
> I need to implement the scenario below.
>
>
>
> In the initial load:
>
>1. CRM application will send a file to a folder. This file contains
>customer information of all customers. This file is in a folder in the
>local server. File name is: customer.tsv
>   1. Customer.tsv contains customerid, country, birty_month,
>   activation_date etc
>2. I need to read the contents of customer.tsv.
>3. I will add current timestamp info to the file.
>4. I will transfer customer.tsv to the S3 bucket: customer.history.data
>
>
>
> In the daily loads:
>
>1.  CRM application will send a new file which contains the
>updated/deleted/inserted customer information.
>
>   File name is daily_customer.tsv
>
>1. Daily_customer.tsv contains contains customerid, cdc_field,
>   country, birty_month, activation_date etc
>
> Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted.
>
>1. I need to read the contents of daily_customer.tsv.
>2. I will add current timestamp info to the file.
>3. I will transfer daily_customer.tsv to the S3 bucket:
>customer.daily.data
>4. I need to merge two buckets customer.history.data and
>customer.daily.data.
>   1. Two buckets have timestamp fields. So I need to query all
>   records whose timestamp is the last timestamp.
>   2. I can use row_number() over(partition by customer_id order by
>   timestamp_field desc) as version_number
>   3. Then I can put the records whose version is one, to the final
>   bucket: customer.dimension.data
>
>
>
> I am running Spark on premise.
>
>- Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD
>on a local Spark cluster?
>- Is this approach efficient? Will the queries transfer all historical
>data from AWS S3 to the local cluster?
>- How can I implement this scenario in a more effective way? Like just
>transferring daily data to AWS S3 and then running queries on AWS.
>   - For instance Athena can query on AWS. But it is just a query
>   engine. As I know I can not call it by using an sdk and I can not write 
> the
>   results to a bucket/folder.
>
>
>
> Thanks in advance,
>
> Ömer
>
>
>
>
>
>
>
>
>


Re: ALS block settings

2018-10-23 Thread evanzamir
I have the same question. Trying to figure out how to get ALS to complete
with larger dataset. It seems to get stuck on "Count" from what I can tell.
I'm running 8 r4.4xlarge instances on Amazon EMR. The dataset is 80 GB (just
to give some idea of size). I assumed Spark could handle this, but maybe I
need to try some different settings like userBlock or itemBlock. Any help
appreciated!



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Spark UI] Spark 2.3.1 UI no longer respects spark.ui.retainedJobs

2018-10-23 Thread Patrick Brown
I believe I may be able to reproduce this now, it seems like it may be
something to do with many jobs at once:

Spark 2.3.1

> spark-shell --conf spark.ui.retainedJobs=1

scala> import scala.concurrent._
scala> import scala.concurrent.ExecutionContext.Implicits.global
scala> for (i <- 0 until 5) { Future { println(sc.parallelize(0 until
i).collect.length) } }

On Mon, Oct 22, 2018 at 11:25 AM Marcelo Vanzin  wrote:

> Just tried on 2.3.2 and worked fine for me. UI had a single job and a
> single stage (+ the tasks related to that single stage), same thing in
> memory (checked with jvisualvm).
>
> On Sat, Oct 20, 2018 at 6:45 PM Marcelo Vanzin 
> wrote:
> >
> > On Tue, Oct 16, 2018 at 9:34 AM Patrick Brown
> >  wrote:
> > > I recently upgraded to spark 2.3.1 I have had these same settings in
> my spark submit script, which worked on 2.0.2, and according to the
> documentation appear to not have changed:
> > >
> > > spark.ui.retainedTasks=1
> > > spark.ui.retainedStages=1
> > > spark.ui.retainedJobs=1
> >
> > I tried that locally on the current master and it seems to be working.
> > I don't have 2.3 easily in front of me right now, but will take a look
> > Monday.
> >
> > --
> > Marcelo
>
>
>
> --
> Marcelo
>


Triggering sql on Was S3 via Apache Spark

2018-10-23 Thread Omer.Ozsakarya
Hi guys,

We are using Apache Spark on a local machine.

I need to implement the scenario below.

In the initial load:

  1.  CRM application will send a file to a folder. This file contains customer 
information of all customers. This file is in a folder in the local server. 
File name is: customer.tsv
 *   Customer.tsv contains customerid, country, birty_month, 
activation_date etc
  2.  I need to read the contents of customer.tsv.
  3.  I will add current timestamp info to the file.
  4.  I will transfer customer.tsv to the S3 bucket: customer.history.data

In the daily loads:

  1.   CRM application will send a new file which contains the 
updated/deleted/inserted customer information.

  File name is daily_customer.tsv

 *   Daily_customer.tsv contains contains customerid, cdc_field, country, 
birty_month, activation_date etc

Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted.

  1.  I need to read the contents of daily_customer.tsv.
  2.  I will add current timestamp info to the file.
  3.  I will transfer daily_customer.tsv to the S3 bucket: customer.daily.data
  4.  I need to merge two buckets customer.history.data and customer.daily.data.
 *   Two buckets have timestamp fields. So I need to query all records 
whose timestamp is the last timestamp.
 *   I can use row_number() over(partition by customer_id order by 
timestamp_field desc) as version_number
 *   Then I can put the records whose version is one, to the final bucket: 
customer.dimension.data

I am running Spark on premise.

  *   Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD on a 
local Spark cluster?
  *   Is this approach efficient? Will the queries transfer all historical data 
from AWS S3 to the local cluster?
  *   How can I implement this scenario in a more effective way? Like just 
transferring daily data to AWS S3 and then running queries on AWS.
 *   For instance Athena can query on AWS. But it is just a query engine. 
As I know I can not call it by using an sdk and I can not write the results to 
a bucket/folder.

Thanks in advance,
Ömer