Re: help/suggestions to setup spark cluster

2017-04-26 Thread anna stax
Thanks Cody, As I already mentioned I am running spark streaming on EC2 cluster in standalone mode. Now in addition to streaming, I want to be able to run spark batch job hourly and adhoc queries using Zeppelin. Can you please confirm that a standalone cluster is OK for this. Please provide me

Re: help/suggestions to setup spark cluster

2017-04-26 Thread Cody Koeninger
The standalone cluster manager is fine for production. Don't use Yarn or Mesos unless you already have another need for it. On Wed, Apr 26, 2017 at 4:53 PM, anna stax wrote: > Hi Sam, > > Thank you for the reply. > > What do you mean by > I doubt people run spark in a.

Re: How to create SparkSession using SparkConf?

2017-04-26 Thread kant kodali
I am using Spark 2.1 BTW. On Wed, Apr 26, 2017 at 3:22 PM, kant kodali wrote: > Hi All, > > I am wondering how to create SparkSession using SparkConf object? Although > I can see that most of the key value pairs we set in SparkConf we can also > set in SparkSession or

10th Spark Summit 2017 at Moscone Center

2017-04-26 Thread Jules Damji
Fellow Spark users, The Spark Summit Program Committee requested that I share with this Spark user group few sessions and events they have added this year: Hackathon 1-day and 2-day training courses 3 new tracks: Technical Deep Dive, Streaming and Machine Learning and more… If you planing to

How to create SparkSession using SparkConf?

2017-04-26 Thread kant kodali
Hi All, I am wondering how to create SparkSession using SparkConf object? Although I can see that most of the key value pairs we set in SparkConf we can also set in SparkSession or SparkSession.Builder however I don't see sparkConf.setJars which is required right? Because we want the driver jar

Re: help/suggestions to setup spark cluster

2017-04-26 Thread anna stax
Hi Sam, Thank you for the reply. What do you mean by I doubt people run spark in a. Single EC2 instance, certainly not in production I don't think What is wrong in having a data pipeline on EC2 that reads data from kafka, processes using spark and outputs to cassandra? Please explain. Thanks

Re: help/suggestions to setup spark cluster

2017-04-26 Thread Sam Elamin
Hi Anna There are a variety of options for launching spark clusters. I doubt people run spark in a. Single EC2 instance, certainly not in production I don't think I don't have enough information of what you are trying to do but if you are just trying to set things up from scratch then I think

help/suggestions to setup spark cluster

2017-04-26 Thread anna stax
I need to setup a spark cluster for Spark streaming and scheduled batch jobs and adhoc queries. Please give me some suggestions. Can this be done in standalone mode. Right now we have a spark cluster in standalone mode on AWS EC2 running spark streaming application. Can we run spark batch jobs

Re: weird error message

2017-04-26 Thread Jacek Laskowski
Hi, Good progress! Can you remove metastore_db directory and start ./bin/pyspark over? I don't think starting from ~ is necessary. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark Follow me at

Re: Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-26 Thread Cody Koeninger
have you read http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#kafka-itself On Wed, Apr 26, 2017 at 1:17 PM, Dominik Safaric wrote: > The reason why I want to obtain this information, i.e. timestamp> tuples is to relate

Re: Spark Testing Library Discussion

2017-04-26 Thread Holden Karau
Sorry about that, hangouts on air broke in the first one :( On Wed, Apr 26, 2017 at 8:41 AM, Marco Mistroni wrote: > Uh i stayed online in the other link but nobody joinedWill follow > transcript > Kr > > On 26 Apr 2017 9:35 am, "Holden Karau"

Re: Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-26 Thread Dominik Safaric
The reason why I want to obtain this information, i.e. tuples is to relate the consumption with the production rates using the __consumer_offsets Kafka internal topic. Interestedly, the Spark’s KafkaConsumer implementation does not auto commit the offsets upon

Re: weird error message

2017-04-26 Thread Afshin, Bardia
Kicking off the process from ~ directory makes the message go away. I guess the metastore_db created is relative to path of where it’s executed. FIX: kick off from ~ directory ./spark-2.1.0-bin-hadoop2.7/bin/pysark From: "Afshin, Bardia" Date: Wednesday, April 26,

Re: Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-26 Thread Cody Koeninger
What is it you're actually trying to accomplish? You can get topic, partition, and offset bounds from an offset range like http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#obtaining-offsets Timestamp isn't really a meaningful idea for a range of offsets. On Tue, Apr

Calculate mode separately for multiple columns in row

2017-04-26 Thread Everett Anderson
Hi, One common situation I run across is that I want to compact my data and select the mode (most frequent value) in several columns for each group. Even calculating mode for one column in SQL is a bit tricky. The ways I've seen usually involve a nested sub-select with a group by + count and

Re: weird error message

2017-04-26 Thread Afshin, Bardia
Thanks for the hint, I don’t think. I thought it’s a permission issue that it cannot read or write to ~/metastore_db but the directory is definitely there drwxrwx--- 5 ubuntu ubuntu 4.0K Apr 25 23:27 metastore_db Just re ran the command from within root spark folder ./bin/pyspark and the

Last chance: ApacheCon is just three weeks away

2017-04-26 Thread Rich Bowen
ApacheCon is just three weeks away, in Miami, Florida, May 15th - 18th. http://apachecon.com/ There's still time to register and attend. ApacheCon is the best place to find out about tomorrow's software, today. ApacheCon is the official convention of The Apache Software Foundation, and includes

Re: Spark Testing Library Discussion

2017-04-26 Thread Marco Mistroni
Uh i stayed online in the other link but nobody joinedWill follow transcript Kr On 26 Apr 2017 9:35 am, "Holden Karau" wrote: > And the recording of our discussion is at https://www.youtube.com/ > watch?v=2q0uAldCQ8M > A few of us have follow up things and we will try

Re: Create dataframe from RDBMS table using JDBC

2017-04-26 Thread Subhash Sriram
Hi Devender, I have always gone with the 2nd approach, only so I don't have to chain a bunch of "option()." calls together. You should be able to use either. Thanks, Subhash Sent from my iPhone > On Apr 26, 2017, at 3:26 AM, Devender Yadav > wrote: > > Hi All,

Re: Spark diclines mesos offers

2017-04-26 Thread Pavel Plotnikov
Michael Gummelt, Thanks!!! I'm forgot about debug logging! On Mon, Apr 24, 2017 at 9:30 PM Michael Gummelt wrote: > Have you run with debug logging? There are some hints in the debug logs: >

Re: Spark Testing Library Discussion

2017-04-26 Thread Holden Karau
And the recording of our discussion is at https://www.youtube.com/watch?v=2q0uAldCQ8M A few of us have follow up things and we will try and do another meeting in about a month or two :) On Tue, Apr 25, 2017 at 1:04 PM, Holden Karau wrote: > Urgh hangouts did something

Re: Spark-SQL Query Optimization: overlapping ranges

2017-04-26 Thread Jacek Laskowski
explain it and you'll know what happens under the covers. i.e. Use explain on the Dataset. Jacek On 25 Apr 2017 12:46 a.m., "Lavelle, Shawn" wrote: > Hello Spark Users! > >Does the Spark Optimization engine reduce overlapping column ranges? > If so, should it push

Re: weird error message

2017-04-26 Thread Jacek Laskowski
Hi, You've got two spark sessions up and running (and given Spark SQL uses Derby-managed Hive MetaStock hence the issue) Please don't start spark-submit from inside bin. Rather bin/spark-submit... Jacek On 26 Apr 2017 1:57 a.m., "Afshin, Bardia" wrote: I’m

Create dataframe from RDBMS table using JDBC

2017-04-26 Thread Devender Yadav
Hi All, I am using Spak 1.6.2 Which is suitable way to create dataframe from RDBMS table. DataFrame df = sqlContext.read().format("jdbc").options(options).load(); or DataFrame df = sqlContext.read().jdbc(url, table, properties); Regards, Devender

WrappedArray to row of relational Db

2017-04-26 Thread vaibhavrtk
I have nested structure which i read from an xml using spark-Xml. I want to use spark sql to convert this nested structure to different relational tables (WrappedArray([WrappedArray([[null,592006340,null],null,BA,M,1724]),N,2017-04-05T16:31:03,586257528),659925562) which has a schema: