Re: help/suggestions to setup spark cluster

2017-04-26 Thread anna stax
Thanks Cody,

As I already mentioned I am running spark streaming on EC2 cluster in
standalone mode. Now in addition to streaming, I want to be able to run
spark batch job hourly and adhoc queries using Zeppelin.

Can you please confirm that a standalone cluster is OK for this. Please
provide me some links to help me get started.

Thanks
-Anna

On Wed, Apr 26, 2017 at 7:46 PM, Cody Koeninger  wrote:

> The standalone cluster manager is fine for production.  Don't use Yarn
> or Mesos unless you already have another need for it.
>
> On Wed, Apr 26, 2017 at 4:53 PM, anna stax  wrote:
> > Hi Sam,
> >
> > Thank you for the reply.
> >
> > What do you mean by
> > I doubt people run spark in a. Single EC2 instance, certainly not in
> > production I don't think
> >
> > What is wrong in having a data pipeline on EC2 that reads data from
> kafka,
> > processes using spark and outputs to cassandra? Please explain.
> >
> > Thanks
> > -Anna
> >
> > On Wed, Apr 26, 2017 at 2:22 PM, Sam Elamin 
> wrote:
> >>
> >> Hi Anna
> >>
> >> There are a variety of options for launching spark clusters. I doubt
> >> people run spark in a. Single EC2 instance, certainly not in production
> I
> >> don't think
> >>
> >> I don't have enough information of what you are trying to do but if you
> >> are just trying to set things up from scratch then I think you can just
> use
> >> EMR which will create a cluster for you and attach a zeppelin instance
> as
> >> well
> >>
> >>
> >> You can also use databricks for ease of use and very little management
> but
> >> you will pay a premium for that abstraction
> >>
> >>
> >> Regards
> >> Sam
> >> On Wed, 26 Apr 2017 at 22:02, anna stax  wrote:
> >>>
> >>> I need to setup a spark cluster for Spark streaming and scheduled batch
> >>> jobs and adhoc queries.
> >>> Please give me some suggestions. Can this be done in standalone mode.
> >>>
> >>> Right now we have a spark cluster in standalone mode on AWS EC2 running
> >>> spark streaming application. Can we run spark batch jobs and zeppelin
> on the
> >>> same. Do we need a better resource manager like Mesos?
> >>>
> >>> Are there any companies or individuals that can help in setting this
> up?
> >>>
> >>> Thank you.
> >>>
> >>> -Anna
> >
> >
>


Re: help/suggestions to setup spark cluster

2017-04-26 Thread Cody Koeninger
The standalone cluster manager is fine for production.  Don't use Yarn
or Mesos unless you already have another need for it.

On Wed, Apr 26, 2017 at 4:53 PM, anna stax  wrote:
> Hi Sam,
>
> Thank you for the reply.
>
> What do you mean by
> I doubt people run spark in a. Single EC2 instance, certainly not in
> production I don't think
>
> What is wrong in having a data pipeline on EC2 that reads data from kafka,
> processes using spark and outputs to cassandra? Please explain.
>
> Thanks
> -Anna
>
> On Wed, Apr 26, 2017 at 2:22 PM, Sam Elamin  wrote:
>>
>> Hi Anna
>>
>> There are a variety of options for launching spark clusters. I doubt
>> people run spark in a. Single EC2 instance, certainly not in production I
>> don't think
>>
>> I don't have enough information of what you are trying to do but if you
>> are just trying to set things up from scratch then I think you can just use
>> EMR which will create a cluster for you and attach a zeppelin instance as
>> well
>>
>>
>> You can also use databricks for ease of use and very little management but
>> you will pay a premium for that abstraction
>>
>>
>> Regards
>> Sam
>> On Wed, 26 Apr 2017 at 22:02, anna stax  wrote:
>>>
>>> I need to setup a spark cluster for Spark streaming and scheduled batch
>>> jobs and adhoc queries.
>>> Please give me some suggestions. Can this be done in standalone mode.
>>>
>>> Right now we have a spark cluster in standalone mode on AWS EC2 running
>>> spark streaming application. Can we run spark batch jobs and zeppelin on the
>>> same. Do we need a better resource manager like Mesos?
>>>
>>> Are there any companies or individuals that can help in setting this up?
>>>
>>> Thank you.
>>>
>>> -Anna
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to create SparkSession using SparkConf?

2017-04-26 Thread kant kodali
I am using Spark 2.1 BTW.

On Wed, Apr 26, 2017 at 3:22 PM, kant kodali  wrote:

> Hi All,
>
> I am wondering how to create SparkSession using SparkConf object? Although
> I can see that most of the key value pairs we set in SparkConf we can also
> set in SparkSession or  SparkSession.Builder however I don't see
> sparkConf.setJars which is required right? Because we want the driver jar
> to be distributed across the cluster whether we run it in client mode or
> cluster mode. so I am wondering how is this possible?
>
> Thanks!
>
>


10th Spark Summit 2017 at Moscone Center

2017-04-26 Thread Jules Damji
Fellow Spark users,

The Spark Summit Program Committee requested that I share with this Spark user 
group few sessions and events they have added this year:
Hackathon
1-day and 2-day training courses
3 new tracks: Technical Deep Dive, Streaming and Machine Learning
and more… 

If you planing to attend, use the special discount code (15%) for this user 
mailing list: ASG2017 (valid through May 4).

Register at https://spark-summit.org/2017/

Hope to see and meet some members of this group presenting and attending there!

Cheers,
Jules

--
The Best Ideas Are Simple
Jules S. Damji
e-mail:dmat...@comcast.net
e-mail:jules.da...@gmail.com
twitter:@2twitme



How to create SparkSession using SparkConf?

2017-04-26 Thread kant kodali
Hi All,

I am wondering how to create SparkSession using SparkConf object? Although
I can see that most of the key value pairs we set in SparkConf we can also
set in SparkSession or  SparkSession.Builder however I don't see
sparkConf.setJars which is required right? Because we want the driver jar
to be distributed across the cluster whether we run it in client mode or
cluster mode. so I am wondering how is this possible?

Thanks!


Re: help/suggestions to setup spark cluster

2017-04-26 Thread anna stax
Hi Sam,

Thank you for the reply.

What do you mean by
I doubt people run spark in a. Single EC2 instance, certainly not in
production I don't think

What is wrong in having a data pipeline on EC2 that reads data from kafka,
processes using spark and outputs to cassandra? Please explain.

Thanks
-Anna

On Wed, Apr 26, 2017 at 2:22 PM, Sam Elamin  wrote:

> Hi Anna
>
> There are a variety of options for launching spark clusters. I doubt
> people run spark in a. Single EC2 instance, certainly not in production I
> don't think
>
> I don't have enough information of what you are trying to do but if you
> are just trying to set things up from scratch then I think you can just use
> EMR which will create a cluster for you and attach a zeppelin instance as
> well
>
>
> You can also use databricks for ease of use and very little management but
> you will pay a premium for that abstraction
>
>
> Regards
> Sam
> On Wed, 26 Apr 2017 at 22:02, anna stax  wrote:
>
>> I need to setup a spark cluster for Spark streaming and scheduled batch
>> jobs and adhoc queries.
>> Please give me some suggestions. Can this be done in standalone mode.
>>
>> Right now we have a spark cluster in standalone mode on AWS EC2 running
>> spark streaming application. Can we run spark batch jobs and zeppelin on
>> the same. Do we need a better resource manager like Mesos?
>>
>> Are there any companies or individuals that can help in setting this up?
>>
>> Thank you.
>>
>> -Anna
>>
>


Re: help/suggestions to setup spark cluster

2017-04-26 Thread Sam Elamin
Hi Anna

There are a variety of options for launching spark clusters. I doubt people
run spark in a. Single EC2 instance, certainly not in production I don't
think

I don't have enough information of what you are trying to do but if you are
just trying to set things up from scratch then I think you can just use EMR
which will create a cluster for you and attach a zeppelin instance as well


You can also use databricks for ease of use and very little management but
you will pay a premium for that abstraction


Regards
Sam
On Wed, 26 Apr 2017 at 22:02, anna stax  wrote:

> I need to setup a spark cluster for Spark streaming and scheduled batch
> jobs and adhoc queries.
> Please give me some suggestions. Can this be done in standalone mode.
>
> Right now we have a spark cluster in standalone mode on AWS EC2 running
> spark streaming application. Can we run spark batch jobs and zeppelin on
> the same. Do we need a better resource manager like Mesos?
>
> Are there any companies or individuals that can help in setting this up?
>
> Thank you.
>
> -Anna
>


help/suggestions to setup spark cluster

2017-04-26 Thread anna stax
I need to setup a spark cluster for Spark streaming and scheduled batch
jobs and adhoc queries.
Please give me some suggestions. Can this be done in standalone mode.

Right now we have a spark cluster in standalone mode on AWS EC2 running
spark streaming application. Can we run spark batch jobs and zeppelin on
the same. Do we need a better resource manager like Mesos?

Are there any companies or individuals that can help in setting this up?

Thank you.
-Anna


Re: weird error message

2017-04-26 Thread Jacek Laskowski
Hi,

Good progress!

Can you remove metastore_db directory and start ./bin/pyspark over? I
don't think starting from ~ is necessary.

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Wed, Apr 26, 2017 at 8:10 PM, Afshin, Bardia
 wrote:
> Kicking off the process from ~ directory makes the message go away. I guess
> the metastore_db created is relative to path of where it’s executed.
>
> FIX: kick off from ~ directory
>
> ./spark-2.1.0-bin-hadoop2.7/bin/pysark
>
>
>
> From: "Afshin, Bardia" 
> Date: Wednesday, April 26, 2017 at 9:47 AM
> To: Jacek Laskowski 
> Cc: "user@spark.apache.org" 
>
>
> Subject: Re: weird error message
>
>
>
> Thanks for the hint, I don’t think. I thought it’s a permission issue that
> it cannot read or write to ~/metastore_db but the directory is definitely
> there
>
>
>
> drwxrwx---  5 ubuntu ubuntu 4.0K Apr 25 23:27 metastore_db
>
>
>
>
>
> Just re ran the command from within root spark folder ./bin/pyspark and the
> same issue.
>
>
>
> Caused by: ERROR XBM0H: Directory
> /home/ubuntu/spark-2.1.0-bin-hadoop2.7/metastore_db cannot be created.
>
> at
> org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
>
> at
> org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
>
> at
> org.apache.derby.impl.services.monitor.StorageFactoryService$10.run(Unknown
> Source)
>
> at java.security.AccessController.doPrivileged(Native
> Method)
>
> at
> org.apache.derby.impl.services.monitor.StorageFactoryService.createServiceRoot(Unknown
> Source)
>
> at
> org.apache.derby.impl.services.monitor.BaseMonitor.bootService(Unknown
> Source)
>
> at
> org.apache.derby.impl.services.monitor.BaseMonitor.createPersistentService(Unknown
> Source)
>
> at
> org.apache.derby.impl.services.monitor.FileMonitor.createPersistentService(Unknown
> Source)
>
> at
> org.apache.derby.iapi.services.monitor.Monitor.createPersistentService(Unknown
> Source)
>
> at org.apache.derby.impl.jdbc.EmbedConnection$5.run(Unknown
> Source)
>
> at java.security.AccessController.doPrivileged(Native
> Method)
>
> at
> org.apache.derby.impl.jdbc.EmbedConnection.createPersistentService(Unknown
> Source)
>
> ... 105 more
>
> Traceback (most recent call last):
>
>   File "/home/ubuntu/spark-2.1.0-bin-hadoop2.7/python/pyspark/shell.py",
> line 43, in 
>
> spark = SparkSession.builder\
>
>   File
> "/home/ubuntu/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/session.py", line
> 179, in getOrCreate
>
> session._jsparkSession.sessionState().conf().setConfString(key, value)
>
>   File
> "/home/ubuntu/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
> line 1133, in __call__
>
>   File "/home/ubuntu/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/utils.py",
> line 79, in deco
>
> raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
>
> pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating
> 'org.apache.spark.sql.hive.HiveSessionState':"
>

>
> ubuntu@:~/spark-2.1.0-bin-hadoop2.7$ ps aux | grep spark
>
> ubuntu 2796  0.0  0.0  10460   932 pts/0S+   16:44   0:00 grep
> --color=auto spark
>
>
>
> From: Jacek Laskowski 
> Date: Wednesday, April 26, 2017 at 12:51 AM
> To: "Afshin, Bardia" 
> Cc: user 
> Subject: Re: weird error message
>
>
>
> Hi,
>
>
>
> You've got two spark sessions up and running (and given Spark SQL uses
> Derby-managed Hive MetaStock hence the issue)
>
>
>
> Please don't start spark-submit from inside bin. Rather bin/spark-submit...
>
>
>
> Jacek
>
>
>
>
>
> On 26 Apr 2017 1:57 a.m., "Afshin, Bardia" 
> wrote:
>
> I’m having issues when I fire up pyspark on a fresh install.
>
> When I submit the same process via spark-submit it works.
>
>
>
> Here’s a dump of the trace:
>
> at
> org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:497)
>
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>
> at
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>
> at py4j.Gateway.invoke(Gateway.java:280)
>
> at
> 

Re: Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-26 Thread Cody Koeninger
have you read

http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#kafka-itself

On Wed, Apr 26, 2017 at 1:17 PM, Dominik Safaric
 wrote:
> The reason why I want to obtain this information, i.e.  timestamp> tuples is to relate the consumption with the production rates 
> using the __consumer_offsets Kafka internal topic. Interestedly, the Spark’s 
> KafkaConsumer implementation does not auto commit the offsets upon offset 
> commit expiration, because as seen in the logs, Spark overrides the 
> enable.auto.commit property to false.
>
> Any idea onto how to use the KafkaConsumer’s auto offset commits? Keep in 
> mind that I do not care about exactly-once, hence having messages replayed is 
> perfectly fine.
>
>> On 26 Apr 2017, at 19:26, Cody Koeninger  wrote:
>>
>> What is it you're actually trying to accomplish?
>>
>> You can get topic, partition, and offset bounds from an offset range like
>>
>> http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#obtaining-offsets
>>
>> Timestamp isn't really a meaningful idea for a range of offsets.
>>
>>
>> On Tue, Apr 25, 2017 at 2:43 PM, Dominik Safaric
>>  wrote:
>>> Hi all,
>>>
>>> Because the Spark Streaming direct Kafka consumer maps offsets for a given
>>> Kafka topic and a partition internally while having enable.auto.commit set
>>> to false, how can I retrieve the offset of each made consumer’s poll call
>>> using the offset ranges of an RDD? More precisely, the information I seek to
>>> get after each poll call is the following: .
>>>
>>> Thanks in advance,
>>> Dominik
>>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark Testing Library Discussion

2017-04-26 Thread Holden Karau
Sorry about that, hangouts on air broke in the first one :(

On Wed, Apr 26, 2017 at 8:41 AM, Marco Mistroni  wrote:

> Uh i stayed online in the other link but nobody joinedWill follow
> transcript
> Kr
>
> On 26 Apr 2017 9:35 am, "Holden Karau"  wrote:
>
>> And the recording of our discussion is at https://www.youtube.com/wat
>> ch?v=2q0uAldCQ8M
>> A few of us have follow up things and we will try and do another meeting
>> in about a month or two :)
>>
>> On Tue, Apr 25, 2017 at 1:04 PM, Holden Karau 
>> wrote:
>>
>>> Urgh hangouts did something frustrating, updated link
>>> https://hangouts.google.com/hangouts/_/ha6kusycp5fvzei2trhay4uhhqe
>>>
>>> On Mon, Apr 24, 2017 at 12:13 AM, Holden Karau 
>>> wrote:
>>>
 The (tentative) link for those interested is
 https://hangouts.google.com/hangouts/_/oyjvcnffejcjhi6qazf3lysypue .

 On Mon, Apr 24, 2017 at 12:02 AM, Holden Karau 
 wrote:

> So 14 people have said they are available on Tuesday the 25th at 1PM
> pacific so we will do this meeting then ( https://doodle.com/poll/69y6
> yab4pyf7u8bn ).
>
> Since hangouts tends to work ok on the Linux distro I'm running my
> default is to host this as a "hangouts-on-air" unless there are 
> alternative
> ideas.
>
> I'll record the hangout and if it isn't terrible I'll post it for
> those who weren't able to make it (and for next time I'll include more
> European friendly time options - Doodle wouldn't let me update it once
> posted).
>
> On Fri, Apr 14, 2017 at 11:17 AM, Holden Karau 
> wrote:
>
>> Hi Spark Users (+ Some Spark Testing Devs on BCC),
>>
>> Awhile back on one of the many threads about testing in Spark there
>> was some interest in having a chat about the state of Spark testing and
>> what people want/need.
>>
>> So if you are interested in joining an online (with maybe an IRL
>> component if enough people are SF based) chat about Spark testing please
>> fill out this doodle - https://doodle.com/poll/69y6yab4pyf7u8bn
>>
>> I think reasonable topics of discussion could be:
>>
>> 1) What is the state of the different Spark testing libraries in the
>> different core (Scala, Python, R, Java) and extended languages (C#,
>> Javascript, etc.)?
>> 2) How do we make these more easily discovered by users?
>> 3) What are people looking for in their testing libraries that we are
>> missing? (can be functionality, documentation, etc.)
>> 4) Are there any examples of well tested open source Spark projects
>> and where are they?
>>
>> If you have other topics that's awesome.
>>
>> To clarify this about libraries and best practices for people testing
>> their Spark applications, and less about testing Spark's internals
>> (although as illustrated by some of the libraries there is some strong
>> overlap in what is required to make that work).
>>
>> Cheers,
>>
>> Holden :)
>>
>> --
>> Cell : 425-233-8271 <(425)%20233-8271>
>> Twitter: https://twitter.com/holdenkarau
>>
>
>
>
> --
> Cell : 425-233-8271 <(425)%20233-8271>
> Twitter: https://twitter.com/holdenkarau
>



 --
 Cell : 425-233-8271 <(425)%20233-8271>
 Twitter: https://twitter.com/holdenkarau

>>>
>>>
>>>
>>> --
>>> Cell : 425-233-8271 <(425)%20233-8271>
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>>
>>
>> --
>> Cell : 425-233-8271 <(425)%20233-8271>
>> Twitter: https://twitter.com/holdenkarau
>>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-26 Thread Dominik Safaric
The reason why I want to obtain this information, i.e.  tuples is to relate the consumption with the production rates using 
the __consumer_offsets Kafka internal topic. Interestedly, the Spark’s 
KafkaConsumer implementation does not auto commit the offsets upon offset 
commit expiration, because as seen in the logs, Spark overrides the 
enable.auto.commit property to false. 

Any idea onto how to use the KafkaConsumer’s auto offset commits? Keep in mind 
that I do not care about exactly-once, hence having messages replayed is 
perfectly fine.   

> On 26 Apr 2017, at 19:26, Cody Koeninger  wrote:
> 
> What is it you're actually trying to accomplish?
> 
> You can get topic, partition, and offset bounds from an offset range like
> 
> http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#obtaining-offsets
> 
> Timestamp isn't really a meaningful idea for a range of offsets.
> 
> 
> On Tue, Apr 25, 2017 at 2:43 PM, Dominik Safaric
>  wrote:
>> Hi all,
>> 
>> Because the Spark Streaming direct Kafka consumer maps offsets for a given
>> Kafka topic and a partition internally while having enable.auto.commit set
>> to false, how can I retrieve the offset of each made consumer’s poll call
>> using the offset ranges of an RDD? More precisely, the information I seek to
>> get after each poll call is the following: .
>> 
>> Thanks in advance,
>> Dominik
>> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: weird error message

2017-04-26 Thread Afshin, Bardia
Kicking off the process from ~ directory makes the message go away. I guess the 
metastore_db created is relative to path of where it’s executed.
FIX: kick off from ~ directory
./spark-2.1.0-bin-hadoop2.7/bin/pysark

From: "Afshin, Bardia" 
Date: Wednesday, April 26, 2017 at 9:47 AM
To: Jacek Laskowski 
Cc: "user@spark.apache.org" 
Subject: Re: weird error message

Thanks for the hint, I don’t think. I thought it’s a permission issue that it 
cannot read or write to ~/metastore_db but the directory is definitely there

drwxrwx---  5 ubuntu ubuntu 4.0K Apr 25 23:27 metastore_db


Just re ran the command from within root spark folder ./bin/pyspark and the 
same issue.

Caused by: ERROR XBM0H: Directory 
/home/ubuntu/spark-2.1.0-bin-hadoop2.7/metastore_db cannot be created.
at 
org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
at 
org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
at 
org.apache.derby.impl.services.monitor.StorageFactoryService$10.run(Unknown 
Source)
at java.security.AccessController.doPrivileged(Native Method)
at 
org.apache.derby.impl.services.monitor.StorageFactoryService.createServiceRoot(Unknown
 Source)
at 
org.apache.derby.impl.services.monitor.BaseMonitor.bootService(Unknown Source)
at 
org.apache.derby.impl.services.monitor.BaseMonitor.createPersistentService(Unknown
 Source)
at 
org.apache.derby.impl.services.monitor.FileMonitor.createPersistentService(Unknown
 Source)
at 
org.apache.derby.iapi.services.monitor.Monitor.createPersistentService(Unknown 
Source)
at org.apache.derby.impl.jdbc.EmbedConnection$5.run(Unknown 
Source)
at java.security.AccessController.doPrivileged(Native Method)
at 
org.apache.derby.impl.jdbc.EmbedConnection.createPersistentService(Unknown 
Source)
... 105 more
Traceback (most recent call last):
  File "/home/ubuntu/spark-2.1.0-bin-hadoop2.7/python/pyspark/shell.py", line 
43, in 
spark = SparkSession.builder\
  File "/home/ubuntu/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/session.py", 
line 179, in getOrCreate
session._jsparkSession.sessionState().conf().setConfString(key, value)
  File 
"/home/ubuntu/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1133, in __call__
  File "/home/ubuntu/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/utils.py", 
line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 
'org.apache.spark.sql.hive.HiveSessionState':"
>>>
ubuntu@:~/spark-2.1.0-bin-hadoop2.7$ ps aux | grep spark
ubuntu 2796  0.0  0.0  10460   932 pts/0S+   16:44   0:00 grep 
--color=auto spark

From: Jacek Laskowski 
Date: Wednesday, April 26, 2017 at 12:51 AM
To: "Afshin, Bardia" 
Cc: user 
Subject: Re: weird error message

Hi,

You've got two spark sessions up and running (and given Spark SQL uses 
Derby-managed Hive MetaStock hence the issue)

Please don't start spark-submit from inside bin. Rather bin/spark-submit...

Jacek


On 26 Apr 2017 1:57 a.m., "Afshin, Bardia" 
> wrote:
I’m having issues when I fire up pyspark on a fresh install.
When I submit the same process via spark-submit it works.

Here’s a dump of the trace:
at 
org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at 
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at 
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.sql.SQLException: Failed to create database 'metastore_db', see 
the next exception for details.
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
at org.apache.derby.impl.jdbc.Util.seeNextException(Unknown Source)
at 

Re: Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-26 Thread Cody Koeninger
What is it you're actually trying to accomplish?

You can get topic, partition, and offset bounds from an offset range like

http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#obtaining-offsets

Timestamp isn't really a meaningful idea for a range of offsets.


On Tue, Apr 25, 2017 at 2:43 PM, Dominik Safaric
 wrote:
> Hi all,
>
> Because the Spark Streaming direct Kafka consumer maps offsets for a given
> Kafka topic and a partition internally while having enable.auto.commit set
> to false, how can I retrieve the offset of each made consumer’s poll call
> using the offset ranges of an RDD? More precisely, the information I seek to
> get after each poll call is the following: .
>
> Thanks in advance,
> Dominik
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Calculate mode separately for multiple columns in row

2017-04-26 Thread Everett Anderson
Hi,

One common situation I run across is that I want to compact my data and
select the mode (most frequent value) in several columns for each group.

Even calculating mode for one column in SQL is a bit tricky. The ways I've
seen usually involve a nested sub-select with a group by + count and then a
window function using rank().

However, what if you want to calculate the mode for several columns,
producing a new row with the results? And let's say the set of columns is
only known at runtime.

In Spark SQL, I start going down a road of many self-joins. The more
efficient way leads me to either RDD[Row] or Dataset[Row] where I could do
a groupByKey + flatMapGroups, keeping state as I iterate over the Rows in
each group.

What's the best way?

Here's a contrived example:

val input = spark.sparkContext.parallelize(Seq(
("catosaur", "black", "claws"),
("catosaur", "orange", "scales"),
("catosaur", "black", "scales"),
("catosaur", "orange", "scales"),
("catosaur", "black", "spikes"),
("bearcopter", "gray", "claws"),
("bearcopter", "black", "fur"),
("bearcopter", "gray", "flight"),
("bearcopter", "gray", "flight")))
.toDF("creature", "color", "feature")

+--+--+---+
|creature  |color |feature|
+--+--+---+
|catosaur  |black |claws  |
|catosaur  |orange|scales |
|catosaur  |black |scales |
|catosaur  |orange|scales |
|catosaur  |black |spikes |
|bearcopter|gray  |claws  |
|bearcopter|black |fur|
|bearcopter|gray  |flight |
|bearcopter|gray  |flight |
+--+--+---+

val expectedOutput = spark.sparkContext.parallelize(Seq(
("catosaur", "black", "scales"),
("bearcopter", "gray", "flight")))
.toDF("creature", "color", "feature")

+--+-+---+
|creature  |color|feature|
+--+-+---+
|catosaur  |black|scales |
|bearcopter|gray |flight |
+--+-+---+


Re: weird error message

2017-04-26 Thread Afshin, Bardia
Thanks for the hint, I don’t think. I thought it’s a permission issue that it 
cannot read or write to ~/metastore_db but the directory is definitely there

drwxrwx---  5 ubuntu ubuntu 4.0K Apr 25 23:27 metastore_db


Just re ran the command from within root spark folder ./bin/pyspark and the 
same issue.

Caused by: ERROR XBM0H: Directory 
/home/ubuntu/spark-2.1.0-bin-hadoop2.7/metastore_db cannot be created.
at 
org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
at 
org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
at 
org.apache.derby.impl.services.monitor.StorageFactoryService$10.run(Unknown 
Source)
at java.security.AccessController.doPrivileged(Native Method)
at 
org.apache.derby.impl.services.monitor.StorageFactoryService.createServiceRoot(Unknown
 Source)
at 
org.apache.derby.impl.services.monitor.BaseMonitor.bootService(Unknown Source)
at 
org.apache.derby.impl.services.monitor.BaseMonitor.createPersistentService(Unknown
 Source)
at 
org.apache.derby.impl.services.monitor.FileMonitor.createPersistentService(Unknown
 Source)
at 
org.apache.derby.iapi.services.monitor.Monitor.createPersistentService(Unknown 
Source)
at org.apache.derby.impl.jdbc.EmbedConnection$5.run(Unknown 
Source)
at java.security.AccessController.doPrivileged(Native Method)
at 
org.apache.derby.impl.jdbc.EmbedConnection.createPersistentService(Unknown 
Source)
... 105 more
Traceback (most recent call last):
  File "/home/ubuntu/spark-2.1.0-bin-hadoop2.7/python/pyspark/shell.py", line 
43, in 
spark = SparkSession.builder\
  File "/home/ubuntu/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/session.py", 
line 179, in getOrCreate
session._jsparkSession.sessionState().conf().setConfString(key, value)
  File 
"/home/ubuntu/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1133, in __call__
  File "/home/ubuntu/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/utils.py", 
line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 
'org.apache.spark.sql.hive.HiveSessionState':"
>>>
ubuntu@:~/spark-2.1.0-bin-hadoop2.7$ ps aux | grep spark
ubuntu 2796  0.0  0.0  10460   932 pts/0S+   16:44   0:00 grep 
--color=auto spark

From: Jacek Laskowski 
Date: Wednesday, April 26, 2017 at 12:51 AM
To: "Afshin, Bardia" 
Cc: user 
Subject: Re: weird error message

Hi,

You've got two spark sessions up and running (and given Spark SQL uses 
Derby-managed Hive MetaStock hence the issue)

Please don't start spark-submit from inside bin. Rather bin/spark-submit...

Jacek


On 26 Apr 2017 1:57 a.m., "Afshin, Bardia" 
> wrote:
I’m having issues when I fire up pyspark on a fresh install.
When I submit the same process via spark-submit it works.

Here’s a dump of the trace:
at 
org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at 
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at 
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.sql.SQLException: Failed to create database 'metastore_db', see 
the next exception for details.
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
at org.apache.derby.impl.jdbc.Util.seeNextException(Unknown Source)
at 
org.apache.derby.impl.jdbc.EmbedConnection.createDatabase(Unknown Source)
at org.apache.derby.impl.jdbc.EmbedConnection.(Unknown Source)
at org.apache.derby.jdbc.InternalDriver$1.run(Unknown Source)
at org.apache.derby.jdbc.InternalDriver$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at 
org.apache.derby.jdbc.InternalDriver.getNewEmbedConnection(Unknown Source)
at 

Last chance: ApacheCon is just three weeks away

2017-04-26 Thread Rich Bowen
ApacheCon is just three weeks away, in Miami, Florida, May 15th - 18th.
http://apachecon.com/

There's still time to register and attend. ApacheCon is the best place
to find out about tomorrow's software, today.

ApacheCon is the official convention of The Apache Software Foundation,
and includes the co-located events:
  * Apache: Big Data
  * Apache: IoT
  * TomcatCon
  * FlexJS Summit
  * Cloudstack Collaboration Conference
  * BarCampApache
  * ApacheCon Lightning Talks

And there's dozens of opportunities to meet your fellow Apache
enthusiasts, both from your project, and from the other 200+ projects at
the Apache Software Foundation.

Register here:
http://events.linuxfoundation.org/events/apachecon-north-america/attend/register-

More information here: http://apachecon.com/

Follow us and learn more about ApacheCon:
  * Twitter: @ApacheCon
  * Discussion mailing list:
https://lists.apache.org/list.html?apachecon-disc...@apache.org
  * Podcasts and speaker interviews: http://feathercast.apache.org/
  * IRC: #apachecon on the https://freenode.net/

We look forward to seeing you in Miami!

-- 
Rich Bowen - VP Conferences, The Apache Software Foundation
http://apachecon.com/
@apachecon



signature.asc
Description: OpenPGP digital signature


Re: Spark Testing Library Discussion

2017-04-26 Thread Marco Mistroni
Uh i stayed online in the other link but nobody joinedWill follow
transcript
Kr

On 26 Apr 2017 9:35 am, "Holden Karau"  wrote:

> And the recording of our discussion is at https://www.youtube.com/
> watch?v=2q0uAldCQ8M
> A few of us have follow up things and we will try and do another meeting
> in about a month or two :)
>
> On Tue, Apr 25, 2017 at 1:04 PM, Holden Karau 
> wrote:
>
>> Urgh hangouts did something frustrating, updated link
>> https://hangouts.google.com/hangouts/_/ha6kusycp5fvzei2trhay4uhhqe
>>
>> On Mon, Apr 24, 2017 at 12:13 AM, Holden Karau 
>> wrote:
>>
>>> The (tentative) link for those interested is https://hangouts.google.com
>>> /hangouts/_/oyjvcnffejcjhi6qazf3lysypue .
>>>
>>> On Mon, Apr 24, 2017 at 12:02 AM, Holden Karau 
>>> wrote:
>>>
 So 14 people have said they are available on Tuesday the 25th at 1PM
 pacific so we will do this meeting then ( https://doodle.com/poll/69y6
 yab4pyf7u8bn ).

 Since hangouts tends to work ok on the Linux distro I'm running my
 default is to host this as a "hangouts-on-air" unless there are alternative
 ideas.

 I'll record the hangout and if it isn't terrible I'll post it for those
 who weren't able to make it (and for next time I'll include more European
 friendly time options - Doodle wouldn't let me update it once posted).

 On Fri, Apr 14, 2017 at 11:17 AM, Holden Karau 
 wrote:

> Hi Spark Users (+ Some Spark Testing Devs on BCC),
>
> Awhile back on one of the many threads about testing in Spark there
> was some interest in having a chat about the state of Spark testing and
> what people want/need.
>
> So if you are interested in joining an online (with maybe an IRL
> component if enough people are SF based) chat about Spark testing please
> fill out this doodle - https://doodle.com/poll/69y6yab4pyf7u8bn
>
> I think reasonable topics of discussion could be:
>
> 1) What is the state of the different Spark testing libraries in the
> different core (Scala, Python, R, Java) and extended languages (C#,
> Javascript, etc.)?
> 2) How do we make these more easily discovered by users?
> 3) What are people looking for in their testing libraries that we are
> missing? (can be functionality, documentation, etc.)
> 4) Are there any examples of well tested open source Spark projects
> and where are they?
>
> If you have other topics that's awesome.
>
> To clarify this about libraries and best practices for people testing
> their Spark applications, and less about testing Spark's internals
> (although as illustrated by some of the libraries there is some strong
> overlap in what is required to make that work).
>
> Cheers,
>
> Holden :)
>
> --
> Cell : 425-233-8271 <(425)%20233-8271>
> Twitter: https://twitter.com/holdenkarau
>



 --
 Cell : 425-233-8271 <(425)%20233-8271>
 Twitter: https://twitter.com/holdenkarau

>>>
>>>
>>>
>>> --
>>> Cell : 425-233-8271 <(425)%20233-8271>
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>>
>>
>> --
>> Cell : 425-233-8271 <(425)%20233-8271>
>> Twitter: https://twitter.com/holdenkarau
>>
>
>
>
> --
> Cell : 425-233-8271 <(425)%20233-8271>
> Twitter: https://twitter.com/holdenkarau
>


Re: Create dataframe from RDBMS table using JDBC

2017-04-26 Thread Subhash Sriram
Hi Devender,

I have always gone with the 2nd approach, only so I don't have to chain a bunch 
of "option()." calls together. You should be able to use either.

Thanks,
Subhash

Sent from my iPhone

> On Apr 26, 2017, at 3:26 AM, Devender Yadav  
> wrote:
> 
> Hi All,
> 
> 
> I am using Spak 1.6.2
> 
> 
> Which is suitable way to create dataframe from RDBMS table. 
> 
> DataFrame df = 
> sqlContext.read().format("jdbc").options(options).load();
> 
> or 
> 
> DataFrame df = sqlContext.read().jdbc(url, table, properties);
> 
> 
> Regards,
> Devender
> 
> 
> 
> 
> 
> 
> 
> NOTE: This message may contain information that is confidential, proprietary, 
> privileged or otherwise protected by law. The message is intended solely for 
> the named addressee. If received in error, please destroy and notify the 
> sender. Any use of this email is prohibited when received in error. Impetus 
> does not represent, warrant and/or guarantee, that the integrity of this 
> communication has been maintained nor that the communication is free of 
> errors, virus, interception or interference.


Re: Spark diclines mesos offers

2017-04-26 Thread Pavel Plotnikov
Michael Gummelt, Thanks!!! I'm forgot about debug logging!

On Mon, Apr 24, 2017 at 9:30 PM Michael Gummelt 
wrote:

> Have you run with debug logging?  There are some hints in the debug logs:
> https://github.com/apache/spark/blob/branch-2.1/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L316
>
> On Mon, Apr 24, 2017 at 4:53 AM, Pavel Plotnikov <
> pavel.plotni...@team.wrike.com> wrote:
>
>> Hi, everyone! I run spark 2.1.0 jobs on the top of Mesos cluster in
>> coarse-grained mode with dynamic resource allocation. And sometimes spark
>> mesos scheduler declines mesos offers despite the fact that not all
>> available resources were used (I have less workers than the possible
>> maximum) and the maximum threshold in the spark configuration is not
>> reached and the queue have lot of pending tasks.
>>
>> May be I have wrong spark or mesos configuration? Does anyone have the
>> same problems?
>>
>
>
>
> --
> Michael Gummelt
> Software Engineer
> Mesosphere
>


Re: Spark Testing Library Discussion

2017-04-26 Thread Holden Karau
And the recording of our discussion is at
https://www.youtube.com/watch?v=2q0uAldCQ8M
A few of us have follow up things and we will try and do another meeting in
about a month or two :)

On Tue, Apr 25, 2017 at 1:04 PM, Holden Karau  wrote:

> Urgh hangouts did something frustrating, updated link
> https://hangouts.google.com/hangouts/_/ha6kusycp5fvzei2trhay4uhhqe
>
> On Mon, Apr 24, 2017 at 12:13 AM, Holden Karau 
> wrote:
>
>> The (tentative) link for those interested is https://hangouts.google.com
>> /hangouts/_/oyjvcnffejcjhi6qazf3lysypue .
>>
>> On Mon, Apr 24, 2017 at 12:02 AM, Holden Karau 
>> wrote:
>>
>>> So 14 people have said they are available on Tuesday the 25th at 1PM
>>> pacific so we will do this meeting then ( https://doodle.com/poll/69y6
>>> yab4pyf7u8bn ).
>>>
>>> Since hangouts tends to work ok on the Linux distro I'm running my
>>> default is to host this as a "hangouts-on-air" unless there are alternative
>>> ideas.
>>>
>>> I'll record the hangout and if it isn't terrible I'll post it for those
>>> who weren't able to make it (and for next time I'll include more European
>>> friendly time options - Doodle wouldn't let me update it once posted).
>>>
>>> On Fri, Apr 14, 2017 at 11:17 AM, Holden Karau 
>>> wrote:
>>>
 Hi Spark Users (+ Some Spark Testing Devs on BCC),

 Awhile back on one of the many threads about testing in Spark there was
 some interest in having a chat about the state of Spark testing and what
 people want/need.

 So if you are interested in joining an online (with maybe an IRL
 component if enough people are SF based) chat about Spark testing please
 fill out this doodle - https://doodle.com/poll/69y6yab4pyf7u8bn

 I think reasonable topics of discussion could be:

 1) What is the state of the different Spark testing libraries in the
 different core (Scala, Python, R, Java) and extended languages (C#,
 Javascript, etc.)?
 2) How do we make these more easily discovered by users?
 3) What are people looking for in their testing libraries that we are
 missing? (can be functionality, documentation, etc.)
 4) Are there any examples of well tested open source Spark projects and
 where are they?

 If you have other topics that's awesome.

 To clarify this about libraries and best practices for people testing
 their Spark applications, and less about testing Spark's internals
 (although as illustrated by some of the libraries there is some strong
 overlap in what is required to make that work).

 Cheers,

 Holden :)

 --
 Cell : 425-233-8271 <(425)%20233-8271>
 Twitter: https://twitter.com/holdenkarau

>>>
>>>
>>>
>>> --
>>> Cell : 425-233-8271 <(425)%20233-8271>
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>>
>>
>> --
>> Cell : 425-233-8271 <(425)%20233-8271>
>> Twitter: https://twitter.com/holdenkarau
>>
>
>
>
> --
> Cell : 425-233-8271 <(425)%20233-8271>
> Twitter: https://twitter.com/holdenkarau
>



-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: Spark-SQL Query Optimization: overlapping ranges

2017-04-26 Thread Jacek Laskowski
explain it and you'll know what happens under the covers.

i.e. Use explain on the Dataset.

Jacek

On 25 Apr 2017 12:46 a.m., "Lavelle, Shawn"  wrote:

> Hello Spark Users!
>
>Does the Spark Optimization engine reduce overlapping column ranges?
> If so, should it push this down to a Data Source?
>
>   Example,
>
> This:  Select * from table where col between 3 and 7 OR col between 5
> and 9
>
> Reduces to:  Select * from table where col between 3 and 9
>
>
>
>
>
>   Thanks for your insight!
>
>
> ~ Shawn M Lavelle
>
>
>
>
>
>
> Shawn Lavelle
> Software Development
>
> 4101 Arrowhead Drive
> Medina, Minnesota 55340-9457
> Phone: 763 551 0559 <(763)%20551-0559>
> Fax: 763 551 0750 <(763)%20551-0750>
> *Email:* shawn.lave...@osii.com
> *Website: **www.osii.com* 
>


Re: weird error message

2017-04-26 Thread Jacek Laskowski
Hi,

You've got two spark sessions up and running (and given Spark SQL uses
Derby-managed Hive MetaStock hence the issue)

Please don't start spark-submit from inside bin. Rather bin/spark-submit...

Jacek


On 26 Apr 2017 1:57 a.m., "Afshin, Bardia" 
wrote:

I’m having issues when I fire up pyspark on a fresh install.

When I submit the same process via *spark-submit* it works.



Here’s a dump of the trace:

at org.apache.spark.sql.SparkSession.sessionState(
SparkSession.scala:109)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(
NativeMethodAccessorImpl.java:62)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(
DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:497)

at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)

at py4j.reflection.ReflectionEngine.invoke(
ReflectionEngine.java:357)

at py4j.Gateway.invoke(Gateway.java:280)

at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.
java:132)

at py4j.commands.CallCommand.execute(CallCommand.java:79)

at py4j.GatewayConnection.run(GatewayConnection.java:214)

at java.lang.Thread.run(Thread.java:745)

Caused by: java.sql.SQLException: Failed to create database 'metastore_db',
see the next exception for details.

at 
org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown
Source)

at 
org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown
Source)

at org.apache.derby.impl.jdbc.Util.seeNextException(Unknown
Source)

at org.apache.derby.impl.jdbc.EmbedConnection.createDatabase(Unknown
Source)

at org.apache.derby.impl.jdbc.EmbedConnection.(Unknown
Source)

at org.apache.derby.jdbc.InternalDriver$1.run(Unknown Source)

at org.apache.derby.jdbc.InternalDriver$1.run(Unknown Source)

at java.security.AccessController.doPrivileged(Native Method)

at 
org.apache.derby.jdbc.InternalDriver.getNewEmbedConnection(Unknown
Source)

at org.apache.derby.jdbc.InternalDriver.connect(Unknown Source)

at org.apache.derby.jdbc.InternalDriver.connect(Unknown Source)

at org.apache.derby.jdbc.AutoloadedDriver.connect(Unknown
Source)

at java.sql.DriverManager.getConnection(DriverManager.java:664)

at java.sql.DriverManager.getConnection(DriverManager.java:208)

at com.jolbox.bonecp.BoneCP.obtainRawInternalConnection(
BoneCP.java:361)

at com.jolbox.bonecp.BoneCP.(BoneCP.java:416)

... 92 more

Caused by: ERROR XJ041: Failed to create database 'metastore_db', see the
next exception for details.

at 
org.apache.derby.iapi.error.StandardException.newException(Unknown
Source)

at org.apache.derby.impl.jdbc.SQLExceptionFactory.
wrapArgsForTransportAcrossDRDA(Unknown Source)

... 108 more

Caused by: ERROR XBM0H: Directory
/home/ubuntu/spark-2.1.0-bin-hadoop2.7/bin/metastore_db
cannot be created.

at 
org.apache.derby.iapi.error.StandardException.newException(Unknown
Source)

at 
org.apache.derby.iapi.error.StandardException.newException(Unknown
Source)

at org.apache.derby.impl.services.monitor.
StorageFactoryService$10.run(Unknown Source)

at java.security.AccessController.doPrivileged(Native Method)

at 
org.apache.derby.impl.services.monitor.StorageFactoryService.createServiceRoot(Unknown
Source)

at 
org.apache.derby.impl.services.monitor.BaseMonitor.bootService(Unknown
Source)

at org.apache.derby.impl.services.monitor.BaseMonitor.
createPersistentService(Unknown Source)

at org.apache.derby.impl.services.monitor.FileMonitor.
createPersistentService(Unknown Source)

at org.apache.derby.iapi.services.monitor.Monitor.
createPersistentService(Unknown Source)

at org.apache.derby.impl.jdbc.EmbedConnection$5.run(Unknown
Source)

at java.security.AccessController.doPrivileged(Native Method)

at org.apache.derby.impl.jdbc.EmbedConnection.
createPersistentService(Unknown Source)

... 105 more

Traceback (most recent call last):

  File "/home/ubuntu/spark-2.1.0-bin-hadoop2.7/python/pyspark/shell.py",
line 43, in 

spark = SparkSession.builder\

  File "/home/ubuntu/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/session.py",
line 179, in getOrCreate

session._jsparkSession.sessionState().conf().setConfString(key, value)

  File "/home/ubuntu/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.
10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__

  File "/home/ubuntu/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/utils.py",
line 79, in deco

raise 

Create dataframe from RDBMS table using JDBC

2017-04-26 Thread Devender Yadav
Hi All,


I am using Spak 1.6.2


Which is suitable way to create dataframe from RDBMS table.


DataFrame df = 
sqlContext.read().format("jdbc").options(options).load();

or

DataFrame df = sqlContext.read().jdbc(url, table, properties);



Regards,
Devender








NOTE: This message may contain information that is confidential, proprietary, 
privileged or otherwise protected by law. The message is intended solely for 
the named addressee. If received in error, please destroy and notify the 
sender. Any use of this email is prohibited when received in error. Impetus 
does not represent, warrant and/or guarantee, that the integrity of this 
communication has been maintained nor that the communication is free of errors, 
virus, interception or interference.


WrappedArray to row of relational Db

2017-04-26 Thread vaibhavrtk
I have nested structure which i read from an xml using spark-Xml. I want to
use spark sql to convert this nested structure to different relational
tables

(WrappedArray([WrappedArray([[null,592006340,null],null,BA,M,1724]),N,2017-04-05T16:31:03,586257528),659925562)

which has a schema:
StructType(StructField(AirSegment,ArrayType(StructType(

StructField(CodeshareDetails,ArrayType(StructType(StructField(Links,StructType(StructField(_VALUE,StringType,true),
StructField(_mktSegmentID,LongType,true),
StructField(_oprSegmentID,LongType,true)),true),
StructField(_alphaSuffix,StringType,true), 
StructField(_carrierCode,StringType,true), 
StructField(_codeshareType,StringType,true), 
StructField(_flightNumber,StringType,true)),true),true),
StructField(_adsIsDeleted,StringType,true), 
StructField(_adsLastUpdateTimestamp,StringType,true), 
StructField(_AirID,LongType,true)),true),true), 
StructField(flightId,LongType,true))


*Question: Now as you can see this codeshareDetails is a wrappedArray inside
a Wrapped array. How can I extract these wrapped array rows along with the
_AirID column so that I can insert these rows in the codeshare table
(sqliteDb) (having column related to codeshare only along with _AirID as
foreign key, used for joining back).*

*PS:I tried exploding but in case if there are multiple rows in the
AirSegment array it doesnt work properly*

My table Structure is mentioned below:

Flight-contatining flightId and other Details:
AirSegment: Containing _AirID(PK), flightID(FK), and AirSegmentDetails
CodeshareDetails: containing CodeshareDetails as well as _AirID(FK)

Let me know if you need any more information



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/WrappedArray-to-row-of-relational-Db-tp28625.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org