Re: Spark job uses only one Worker

2016-01-07 Thread Annabel Melongo
Michael,
I don't know what's your environment but if it's Cloudera, you should be able 
to see the link to your master in the Hue.
Thanks 

On Thursday, January 7, 2016 5:03 PM, Michael Pisula 
 wrote:
 

  I had tried several parameters, including --total-executor-cores, no effect.
 As for the port, I tried 7077, but if I remember correctly I got some kind of 
error that suggested to try 6066, with which it worked just fine (apart from 
this issue here).
 
 Each worker has two cores. I also tried increasing cores, again no effect. I 
was able to increase the number of cores the job was using on one worker, but 
it would not use any other worker (and it would not start if the number of 
cores the job wanted was higher than the number available on one worker).
 
 On 07.01.2016 22:51, Igor Berman wrote:
  
 read about --total-executor-cores not sure why you specify port 6066 in 
master...usually it's 7077
 verify in master ui(usually port 8080) how many cores are there(depends on 
other configs, but usually workers connect to master with all their cores)   
 On 7 January 2016 at 23:46, Michael Pisula  wrote:
 
  Hi,
 
 I start the cluster using the spark-ec2 scripts, so the cluster is in 
stand-alone mode.
 Here is how I submit my job:
 spark/bin/spark-submit --class demo.spark.StaticDataAnalysis --master 
spark://:6066 --deploy-mode cluster demo/Demo-1.0-SNAPSHOT-all.jar
 
 Cheers,
 Michael  
 
 On 07.01.2016 22:41, Igor Berman wrote:
  
 share how you submit your job what cluster(yarn, standalone)  
 On 7 January 2016 at 23:24, Michael Pisula  wrote:
 
Hi there,
 
 I ran a simple Batch Application on a Spark Cluster on EC2. Despite having 3
 Worker Nodes, I could not get the application processed on more than one
 node, regardless if I submitted the Application in Cluster or Client mode.
 I also tried manually increasing the number of partitions in the code, no
 effect. I also pass the master into the application.
 I verified on the nodes themselves that only one node was active while the
 job was running.
 I pass enough data to make the job take 6 minutes to process.
 The job is simple enough, reading data from two S3 files, joining records on
 a shared field, filtering out some records and writing the result back to
 S3.
 
 Tried all kinds of stuff, but could not make it work. I did find similar
 questions, but had already tried the solutions that worked in those cases.
 Would be really happy about any pointers.
 
 Cheers,
 Michael
 
 
 
 --
 View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-uses-only-one-Worker-tp25909.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
-
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 
 
  
  
 
-- 
Michael Pisula * michael.pis...@tngtech.com * +49-174-3180084
TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
Sitz: Unterföhring * Amtsgericht München * HRB 135082
  
  
  
 
 -- 
Michael Pisula * michael.pis...@tngtech.com * +49-174-3180084
TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
Sitz: Unterföhring * Amtsgericht München * HRB 135082
 

  

Re: Date Time Regression as Feature

2016-01-07 Thread Annabel Melongo
Or he can also transform the whole date into a string 

On Thursday, January 7, 2016 2:25 PM, Sujit Pal  
wrote:
 

 Hi Jorge,
Maybe extract things like dd, mm, day of week, time of day from the datetime 
string and use them as features?
-sujit

On Thu, Jan 7, 2016 at 11:09 AM, Jorge Machado  
wrote:

Hello all,

I'm new to machine learning. I'm trying to predict some electric usage  with a 
decision  Free
The data is :
2015-12-10-10:00, 1200
2015-12-11-10:00, 1150

My question is : What is the best way to turn date and time into feature on my 
Vector ?

Something like this :  Vector (1200, [2015,12,10,10,10] )?
I could not fine any example with value prediction where features had dates in 
it.

Thanks

Jorge Machado

Jorge Machado
jo...@jmachado.me


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





  

Re: [Spark 1.6] Spark Streaming - java.lang.AbstractMethodError

2016-01-07 Thread Dibyendu Bhattacharya
Right .. if you are using github version, just modify the ReceiverLauncher
and add that . I will fix it for Spark 1.6 and release new version in
spark-packages for spark 1.6

Dibyendu

On Thu, Jan 7, 2016 at 4:14 PM, Ted Yu  wrote:

> I cloned g...@github.com:dibbhatt/kafka-spark-consumer.git a moment ago.
>
> In ./src/main/java/consumer/kafka/ReceiverLauncher.java , I see:
>jsc.addStreamingListener(new StreamingListener() {
>
> There is no onOutputOperationStarted method implementation.
>
> Looks like it should be added for Spark 1.6.0
>
> Cheers
>
> On Thu, Jan 7, 2016 at 2:39 AM, Dibyendu Bhattacharya <
> dibyendu.bhattach...@gmail.com> wrote:
>
>> You are using low level spark kafka consumer . I am the author of the
>> same.
>>
>> Are you using the spark-packages version ? if yes which one ?
>>
>> Regards,
>> Dibyendu
>>
>> On Thu, Jan 7, 2016 at 4:07 PM, Jacek Laskowski  wrote:
>>
>>> Hi,
>>>
>>> Do you perhaps use custom StreamingListener?
>>> `StreamingListenerBus.scala:47` calls
>>> `StreamingListener.onOutputOperationStarted` that was added in
>>> [SPARK-10900] [STREAMING] Add output operation events to
>>> StreamingListener [1]
>>>
>>> The other guess could be that at runtime you still use Spark < 1.6.
>>>
>>> [1] https://issues.apache.org/jira/browse/SPARK-10900
>>>
>>> Pozdrawiam,
>>> Jacek
>>>
>>> Jacek Laskowski | https://medium.com/@jaceklaskowski/
>>> Mastering Apache Spark
>>> ==> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
>>> Follow me at https://twitter.com/jaceklaskowski
>>>
>>>
>>>
>>> On Thu, Jan 7, 2016 at 10:59 AM, Walid LEZZAR 
>>> wrote:
>>> > Hi,
>>> >
>>> > We have been using spark streaming for a little while now.
>>> >
>>> > Until now, we were running our spark streaming jobs in spark 1.5.1 and
>>> it
>>> > was working well. Yesterday, we upgraded to spark 1.6.0 without any
>>> changes
>>> > in the code. But our streaming jobs are not working any more. We are
>>> getting
>>> > an "AbstractMethodError". Please, find the stack trace at the end of
>>> the
>>> > mail. Can we have some hints on what this error means ? (we are using
>>> spark
>>> > to connect to kafka)
>>> >
>>> > The stack trace :
>>> > 16/01/07 10:44:39 INFO ZkState: Starting curator service
>>> > 16/01/07 10:44:39 INFO CuratorFrameworkImpl: Starting
>>> > 16/01/07 10:44:39 INFO ZooKeeper: Initiating client connection,
>>> > connectString=localhost:2181 sessionTimeout=12
>>> > watcher=org.apache.curator.ConnectionState@2e9fa23a
>>> > 16/01/07 10:44:39 INFO ClientCnxn: Opening socket connection to server
>>> > localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL
>>> > (unknown error)
>>> > 16/01/07 10:44:39 INFO ClientCnxn: Socket connection established to
>>> > localhost/127.0.0.1:2181, initiating session
>>> > 16/01/07 10:44:39 INFO ClientCnxn: Session establishment complete on
>>> server
>>> > localhost/127.0.0.1:2181, sessionid = 0x1521b6d262e0035, negotiated
>>> timeout
>>> > = 6
>>> > 16/01/07 10:44:39 INFO ConnectionStateManager: State change: CONNECTED
>>> > 16/01/07 10:44:40 INFO PartitionManager: Read partition information
>>> from:
>>> >
>>> /spark-kafka-consumer/StreamingArchiver/lbc.job.multiposting.input/partition_0
>>> > --> null
>>> > 16/01/07 10:44:40 INFO JobScheduler: Added jobs for time 145215988
>>> ms
>>> > 16/01/07 10:44:40 INFO JobScheduler: Starting job streaming job
>>> > 145215988 ms.0 from job set of time 145215988 ms
>>> > 16/01/07 10:44:40 ERROR Utils: uncaught error in thread
>>> > StreamingListenerBus, stopping SparkContext
>>> >
>>> > ERROR Utils: uncaught error in thread StreamingListenerBus, stopping
>>> > SparkContext
>>> > java.lang.AbstractMethodError
>>> > at
>>> >
>>> org.apache.spark.streaming.scheduler.StreamingListenerBus.onPostEvent(StreamingListenerBus.scala:47)
>>> > at
>>> >
>>> org.apache.spark.streaming.scheduler.StreamingListenerBus.onPostEvent(StreamingListenerBus.scala:26)
>>> > at
>>> > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
>>> > at
>>> >
>>> org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
>>> > at
>>> >
>>> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80)
>>> > at
>>> >
>>> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>>> > at
>>> >
>>> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>>> > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>>> > at
>>> >
>>> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64)
>>> > at
>>> org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1180)
>>> >

Window Functions importing issue in Spark 1.4.0

2016-01-07 Thread satish chandra j
HI All,
Currently using Spark 1.4.0 version, I have a requirement to add a column
having Sequential Numbering to an existing DataFrame
I understand Window Function "rowNumber" serves my purpose
hence I have below import statements to include the same

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.rowNumber

But I am getting an error at the import statement itself such as
"object expressions is not a member of package org.apache.spark.sql"

"value rowNumber is not a member of object org.apache.spark.sql.functions"

Could anybody throw some light if any to fix the issue

Regards,
Satish Chandra


Re: Window Functions importing issue in Spark 1.4.0

2016-01-07 Thread Jacek Laskowski
Ok, enuf! :) Leaving the room for now as I'm like a copycat :)

https://en.wiktionary.org/wiki/enuf

Pozdrawiam,
Jacek

Jacek Laskowski | https://medium.com/@jaceklaskowski/
Mastering Apache Spark
==> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
Follow me at https://twitter.com/jaceklaskowski


On Thu, Jan 7, 2016 at 12:11 PM, Ted Yu  wrote:
> Please take a look at the following for sample on how rowNumber is used:
> https://github.com/apache/spark/pull/9050
>
> BTW 1.4.0 was an old release.
>
> Please consider upgrading.
>
> On Thu, Jan 7, 2016 at 3:04 AM, satish chandra j 
> wrote:
>>
>> HI All,
>> Currently using Spark 1.4.0 version, I have a requirement to add a column
>> having Sequential Numbering to an existing DataFrame
>> I understand Window Function "rowNumber" serves my purpose
>> hence I have below import statements to include the same
>>
>> import org.apache.spark.sql.expressions.Window
>> import org.apache.spark.sql.functions.rowNumber
>>
>> But I am getting an error at the import statement itself such as
>> "object expressions is not a member of package org.apache.spark.sql"
>>
>> "value rowNumber is not a member of object org.apache.spark.sql.functions"
>>
>> Could anybody throw some light if any to fix the issue
>>
>> Regards,
>> Satish Chandra
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: [Spark 1.6] Spark Streaming - java.lang.AbstractMethodError

2016-01-07 Thread Dibyendu Bhattacharya
Some discussion is there in https://github.com/dibbhatt/kafka-spark-consumer
and some is mentioned in https://issues.apache.org/jira/browse/SPARK-11045

Let me know if those answer your question .

In short, Direct Stream is good choice if you need exact once semantics and
message ordering , but many use case does not need such requirement of
exact-once and message ordering . If you use Direct Stream the RDD
processing parallelism is limited to Kafka partition and you need to store
offset details to external store as checkpoint location is not reliable if
you modify driver code .

Whereas in Receiver based mode , you need to enable WAL for no data loss .
But Spark Receiver based consumer from KafkaUtils which uses Kafka High
Level API has serious issues , and thus if at all you need to switch to
receiver based mode , this low level consumer is a better choice.

Performance wise I have not published any number yet , but from internal
testing and benchmarking I did ( and validated by folks who uses this
consumer ), it perform much better than any existing consumer in Spark .

Regards,
Dibyendu

On Thu, Jan 7, 2016 at 4:28 PM, Jacek Laskowski  wrote:

> On Thu, Jan 7, 2016 at 11:39 AM, Dibyendu Bhattacharya
>  wrote:
> > You are using low level spark kafka consumer . I am the author of the
> same.
>
> If I may ask, what are the differences between this and the direct
> version shipped with spark? I've just started toying with it, and
> would appreciate some guidance. Thanks.
>
> Jacek
>


Re: Window Functions importing issue in Spark 1.4.0

2016-01-07 Thread Ted Yu
Please take a look at the following for sample on how rowNumber is used:
https://github.com/apache/spark/pull/9050

BTW 1.4.0 was an old release.

Please consider upgrading.

On Thu, Jan 7, 2016 at 3:04 AM, satish chandra j 
wrote:

> HI All,
> Currently using Spark 1.4.0 version, I have a requirement to add a column
> having Sequential Numbering to an existing DataFrame
> I understand Window Function "rowNumber" serves my purpose
> hence I have below import statements to include the same
>
> import org.apache.spark.sql.expressions.Window
> import org.apache.spark.sql.functions.rowNumber
>
> But I am getting an error at the import statement itself such as
> "object expressions is not a member of package org.apache.spark.sql"
>
> "value rowNumber is not a member of object org.apache.spark.sql.functions"
>
> Could anybody throw some light if any to fix the issue
>
> Regards,
> Satish Chandra
>


How HiveContext can read subdirectories

2016-01-07 Thread Arkadiusz Bicz
Hi,

Can Spark using HiveContext External Tables read sub-directories?

Example:

import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql._

import sqlContext.implicits._

//prepare data and create subdirectories with parquet
val df = Seq("id1" -> 1, "id2" -> 4, "id3"-> 5).toDF("id", "value")
df.write.parquet("/tmp/df/1")
val df2 = Seq("id6"-> 6, "id7"-> 7, "id8"-> 8).toDF("id", "value")
df2.write.parquet("/tmp/df/2")
val dfall = sqlContext.read.load("/tmp/df/*/")
assert(dfall.count == 6)

//convert to HiveContext
val hc = new HiveContext(sqlContext.sparkContext)

hc.sql("SET hive.mapred.supports.subdirectories=true")
hc.sql("SET mapreduce.input.fileinputformat.input.dir.recursive=true")

hc.sql("create external table testsubdirectories (id string, value
string) STORED AS PARQUET location '/tmp/df'")

val hcall = hc.sql("select * from testsubdirectories")

assert(hcall.count() == 6)  //shoud return 6 but it is 0 as not read
from subdirectories

Thanks,

Arkadiusz Bicz

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



[Spark 1.6] Spark Streaming - java.lang.AbstractMethodError

2016-01-07 Thread Walid LEZZAR
Hi,

We have been using spark streaming for a little while now.

Until now, we were running our spark streaming jobs in spark 1.5.1 and it
was working well. Yesterday, we upgraded to spark 1.6.0 without any changes
in the code. But our streaming jobs are not working any more. We are
getting an "AbstractMethodError". Please, find the stack trace at the end
of the mail. Can we have some hints on what this error means ? (we are
using spark to connect to kafka)

The stack trace :
16/01/07 10:44:39 INFO ZkState: Starting curator service
16/01/07 10:44:39 INFO CuratorFrameworkImpl: Starting
16/01/07 10:44:39 INFO ZooKeeper: Initiating client connection,
connectString=localhost:2181 sessionTimeout=12
watcher=org.apache.curator.ConnectionState@2e9fa23a
16/01/07 10:44:39 INFO ClientCnxn: Opening socket connection to server
localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL
(unknown error)
16/01/07 10:44:39 INFO ClientCnxn: Socket connection established to
localhost/127.0.0.1:2181, initiating session
16/01/07 10:44:39 INFO ClientCnxn: Session establishment complete on server
localhost/127.0.0.1:2181, sessionid = 0x1521b6d262e0035, negotiated timeout
= 6
16/01/07 10:44:39 INFO ConnectionStateManager: State change: CONNECTED
16/01/07 10:44:40 INFO PartitionManager: Read partition information from:
/spark-kafka-consumer/StreamingArchiver/lbc.job.multiposting.input/partition_0
--> null
16/01/07 10:44:40 INFO JobScheduler: Added jobs for time 145215988 ms
16/01/07 10:44:40 INFO JobScheduler: Starting job streaming job
145215988 ms.0 from job set of time 145215988 ms
16/01/07 10:44:40 ERROR Utils: uncaught error in thread
StreamingListenerBus, stopping SparkContext

ERROR Utils: uncaught error in thread StreamingListenerBus, stopping
SparkContext
java.lang.AbstractMethodError
at
org.apache.spark.streaming.scheduler.StreamingListenerBus.onPostEvent(StreamingListenerBus.scala:47)
at
org.apache.spark.streaming.scheduler.StreamingListenerBus.onPostEvent(StreamingListenerBus.scala:26)
at
org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
at
org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
at
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80)
at
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
at
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1180)
at
org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)
16/01/07 10:44:40 INFO JobScheduler: Finished job streaming job
145215988 ms.0 from job set of time 145215988 ms
16/01/07 10:44:40 INFO JobScheduler: Total delay: 0.074 s for time
145215988 ms (execution: 0.032 s)
16/01/07 10:44:40 ERROR JobScheduler: Error running job streaming job
145215988 ms.0
java.lang.IllegalStateException: SparkContext has been shutdown
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
at
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:920)
at
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:918)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:918)
at
org.apache.spark.api.java.JavaRDDLike$class.foreachPartition(JavaRDDLike.scala:225)
at
org.apache.spark.api.java.AbstractJavaRDDLike.foreachPartition(JavaRDDLike.scala:46)
at
fr.leboncoin.morpheus.jobs.streaming.StreamingArchiver.lambda$run$ade930b4$1(StreamingArchiver.java:103)
at
org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachRDD$3.apply(JavaDStreamLike.scala:335)
at
org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachRDD$3.apply(JavaDStreamLike.scala:335)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:661)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:661)
at

Spark shell throws java.lang.RuntimeException

2016-01-07 Thread will
Hi, I wanted to try the 1.6.0 version of Spark, but when I run it into my
local machine, it throws me this exception :

java.lang.RuntimeException: java.lang.RuntimeException: The root scratch
dir: /tmp/hive on HDFS should be writable.

Thing is, this problem happened to me in the 1.5.1 version, and some people
even had it in the  1.5.0 version

  

Thanks for any help.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-shell-throws-java-lang-RuntimeException-tp25903.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Why is this job running since one hour?

2016-01-07 Thread Umesh Kacha
Hi thanks for the response. Each Job is processing around 5gb of skewed
data does group by multiple fields and does aggregation and does
coalesce(1) and saves csv file in gzip format. I think coalesce is causing
problem but data is not that huge I don't understand why it keeps on
running for an hour and avoiding other jobs to run. Please guide.
On Jan 7, 2016 3:58 AM, "Jakob Odersky"  wrote:

> What is the job doing? How much data are you processing?
>
> On 6 January 2016 at 10:33, unk1102  wrote:
>
>> Hi I have one main Spark job which spawns multiple child spark jobs. One
>> of
>> the child spark job is running for an hour and it keeps on hanging there I
>> have taken snap shot please see
>> <
>> http://apache-spark-user-list.1001560.n3.nabble.com/file/n25899/Screen_Shot_2016-01-06_at_11.jpg
>> >
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-this-job-running-since-one-hour-tp25899.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: 101 question on external metastore

2016-01-07 Thread Deenar Toraskar
I sorted this out. There were 2 different version of derby and ensuring the
metastore and spark used the same version of Derby made the problem go away.

Deenar

On 6 January 2016 at 02:55, Yana Kadiyska  wrote:

> Deenar, I have not resolved this issue. Why do you think it's from
> different versions of Derby? I was playing with this as a fun experiment
> and my setup was on a clean machine -- no other versions of
> hive/hadoop/etc...
>
> On Sun, Dec 20, 2015 at 12:17 AM, Deenar Toraskar <
> deenar.toras...@gmail.com> wrote:
>
>> apparently it is down to different versions of derby in the classpath,
>> but i am unsure where the other version is coming from. The setup worked
>> perfectly with spark 1.3.1.
>>
>> Deenar
>>
>> On 20 December 2015 at 04:41, Deenar Toraskar 
>> wrote:
>>
>>> Hi Yana/All
>>>
>>> I am getting the same exception. Did you make any progress?
>>>
>>> Deenar
>>>
>>> On 5 November 2015 at 17:32, Yana Kadiyska 
>>> wrote:
>>>
 Hi folks, trying experiment with a minimal external metastore.

 I am following the instructions here:
 https://cwiki.apache.org/confluence/display/Hive/HiveDerbyServerMode

 I grabbed Derby 10.12.1.1 and started an instance, verified I can
 connect via ij tool and that process is listening on 1527

 put the following hive-site.xml under conf
 ```
 
 
 
 
   javax.jdo.option.ConnectionURL
   jdbc:derby://localhost:1527/metastore_db;create=true
   JDBC connect string for a JDBC metastore
 
 
   javax.jdo.option.ConnectionDriverName
   org.apache.derby.jdbc.ClientDriver
   Driver class name for a JDBC metastore
 
 
 ```

 I then try to run spark-shell thusly:
 bin/spark-shell --driver-class-path
 /home/yana/db-derby-10.12.1.1-bin/lib/derbyclient.jar

 and I get an ugly stack trace like so...

 Caused by: java.lang.NoClassDefFoundError: Could not initialize class
 org.apache.derby.jdbc.EmbeddedDriver
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
 at
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
 at
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
 at java.lang.Class.newInstance(Class.java:379)
 at
 org.datanucleus.store.rdbms.connectionpool.AbstractConnectionPoolFactory.loadDriver(AbstractConnectionPoolFactory.java:47)
 at
 org.datanucleus.store.rdbms.connectionpool.DBCPConnectionPoolFactory.createConnectionPool(DBCPConnectionPoolFactory.java:50)
 at
 org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:238)
 at
 org.datanucleus.store.rdbms.ConnectionFactoryImpl.initialiseDataSources(ConnectionFactoryImpl.java:131)
 at
 org.datanucleus.store.rdbms.ConnectionFactoryImpl.(ConnectionFactoryImpl.java:85)
 ... 114 more

 :10: error: not found: value sqlContext
import sqlContext.implicits._


 What am I doing wrong -- not sure why it's looking for Embedded
 anything, I'm specifically trying to not use the embedded server...but I
 know my hive-site is being read as starting witout --driver-class-path does
 say it can't load org.apache.derby.jdbc.ClientDriver

>>>
>>>
>>
>


Re: spark ui security

2016-01-07 Thread Ted Yu
According to https://spark.apache.org/docs/latest/security.html#web-ui ,
web UI is covered.

FYI

On Thu, Jan 7, 2016 at 6:35 AM, Kostiantyn Kudriavtsev <
kudryavtsev.konstan...@gmail.com> wrote:

> hi community,
>
> do I understand correctly that spark.ui.filters property sets up filters
> only for jobui interface? is it any way to protect spark web ui in the same
> *way?*
>


Re: adding jars - hive on spark cdh 5.4.3

2016-01-07 Thread Prem Sure
did you try -- jars property in spark submit? if your jar is of huge size,
you can pre-load the jar on all executors in a common available directory
to avoid network IO.

On Thu, Jan 7, 2016 at 4:03 PM, Ophir Etzion  wrote:

> I' trying to add jars before running a query using hive on spark on cdh
> 5.4.3.
> I've tried applying the patch in
> https://issues.apache.org/jira/browse/HIVE-12045 (manually as the patch
> is done on a different hive version) but still hasn't succeeded.
>
> did anyone manage to do ADD JAR successfully with CDH?
>
> Thanks,
> Ophir
>


Re: Date Time Regression as Feature

2016-01-07 Thread Yanbo Liang
First extracting year, month, day, time from the datetime.
Then you should decide which variables can be treated as category features
such as year/month/day and encode them to boolean form using OneHotEncoder.
At last using VectorAssembler to assemble the encoded output vector and the
other raw input into the features which can be feed into model trainer.

OneHotEncoder and VectorAssembler are feature transformers provided by
Spark ML, you can refer
https://spark.apache.org/docs/latest/ml-features.html

Thanks
Yanbo

2016-01-08 7:52 GMT+08:00 Annabel Melongo :

> Or he can also transform the whole date into a string
>
>
> On Thursday, January 7, 2016 2:25 PM, Sujit Pal 
> wrote:
>
>
> Hi Jorge,
>
> Maybe extract things like dd, mm, day of week, time of day from the
> datetime string and use them as features?
>
> -sujit
>
>
> On Thu, Jan 7, 2016 at 11:09 AM, Jorge Machado <
> jorge.w.mach...@hotmail.com> wrote:
>
> Hello all,
>
> I'm new to machine learning. I'm trying to predict some electric usage
> with a decision  Free
> The data is :
> 2015-12-10-10:00, 1200
> 2015-12-11-10:00, 1150
>
> My question is : What is the best way to turn date and time into feature
> on my Vector ?
>
> Something like this :  Vector (1200, [2015,12,10,10,10] )?
> I could not fine any example with value prediction where features had
> dates in it.
>
> Thanks
>
> Jorge Machado
>
> Jorge Machado
> jo...@jmachado.me
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>
>


Re: Problems with reading data from parquet files in a HDFS remotely

2016-01-07 Thread Prem Sure
you many need to add

createDataFrame( for Python, inferschema) call before registerTempTable.

Thanks,

Prem


On Thu, Jan 7, 2016 at 12:53 PM, Henrik Baastrup <
henrik.baast...@netscout.com> wrote:

> Hi All,
>
> I have a small Hadoop cluster where I have stored a lot of data in parquet 
> files. I have installed a Spark master service on one of the nodes and now 
> would like to query my parquet files from a Spark client. When I run the 
> following program from the spark-shell on the Spark Master node all function 
> correct:
>
> # val sqlCont = new org.apache.spark.sql.SQLContext(sc)
> # val reader = sqlCont.read
> # val dataFrame = reader.parquet("/user/hdfs/parquet-multi/BICC")
> # dataFrame.registerTempTable("BICC")
> # val recSet = sqlCont.sql("SELECT 
> protocolCode,beginTime,endTime,called,calling FROM BICC WHERE 
> endTime>=14494218 AND endTime<=14494224 AND 
> calling='6287870642893' AND p_endtime=14494224")
> # recSet.show()
>
> But when I run the Java program below, from my client, I get:
>
> Exception in thread "main" java.lang.AssertionError: assertion failed: No 
> predefined schema found, and no Parquet data files or summary files found 
> under file:/user/hdfs/parquet-multi/BICC.
>
> The exception occurs at the line: DataFrame df = 
> reader.parquet("/user/hdfs/parquet-multi/BICC");
>
> On the Master node I can see the client connect when the SparkContext is 
> instanced, as I get the following lines in the Spark log:
>
> 16/01/07 18:27:47 INFO Master: Registering app SparkTest
> 16/01/07 18:27:47 INFO Master: Registered app SparkTest with ID 
> app-20160107182747-00801
>
> If I create a local directory with the given path, my program goes in an 
> endless loop, with the following warning on the console:
>
> WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; 
> check your cluster UI to ensure that workers are registered and have 
> sufficient resources
>
> To me it seams that my SQLContext does not connect to the Spark Master, but 
> try to work locally on the client, where the requested files do not exist.
>
> Java program:
>   SparkConf conf = new SparkConf()
>   .setAppName("SparkTest")
>   .setMaster("spark://172.27.13.57:7077");
>   JavaSparkContext sc = new JavaSparkContext(conf);
>   SQLContext sqlContext = new SQLContext(sc);
>   
>   DataFrameReader reader = sqlContext.read();
>   DataFrame df = reader.parquet("/user/hdfs/parquet-multi/BICC");
>   DataFrame filtered = df.filter("endTime>=14494218 AND 
> endTime<=14494224 AND calling='6287870642893' AND 
> p_endtime=14494224");
>   filtered.show();
>
> Are there someone there can help me?
>
> Henrik
>
>
>


Re: Question in rdd caching in memory using persist

2016-01-07 Thread Prem Sure
are you running standalone - local mode or cluster mode. executor and
driver existance differ based on setup type. snapshot of your env UI would
be helpful to say

On Thu, Jan 7, 2016 at 11:51 AM,  wrote:

> Hi,
>
>
>
> After I called rdd.persist(*MEMORY_ONLY_SER*), I see the driver listed as one 
> of the ‘executors’ participating in holding the partitions of the rdd in 
> memory, the memory usage shown against the driver is 0. This I see in the 
> storage tab of the spark ui.
>
> Why is the driver shown on the ui ? Will it ever hold rdd partitions when
> caching.
>
>
>
>
>
> *-regards*
>
> *Seemanto Barua*
>
>
>
>
>
> PLEASE READ: This message is for the named person's use only. It may
> contain confidential, proprietary or legally privileged information. No
> confidentiality or privilege is waived or lost by any mistransmission. If
> you receive this message in error, please delete it and all copies from
> your system, destroy any hard copies and notify the sender. You must not,
> directly or indirectly, use, disclose, distribute, print, or copy any part
> of this message if you are not the intended recipient. Nomura Holding
> America Inc., Nomura Securities International, Inc, and their respective
> subsidiaries each reserve the right to monitor all e-mail communications
> through its networks. Any views expressed in this message are those of the
> individual sender, except where the message states otherwise and the sender
> is authorized to state the views of such entity. Unless otherwise stated,
> any pricing information in this message is indicative only, is subject to
> change and does not constitute an offer to deal at any price quoted. Any
> reference to the terms of executed transactions should be treated as
> preliminary only and subject to our formal written confirmation.
>


Spark streaming routing

2016-01-07 Thread Lin Zhao
I have a need to route the dstream through the streming pipeline by some key, 
such that data with the same key always goes through the same executor.

There doesn't seem to be a way to do manual routing with Spark Streaming. The 
closest I can come up with is:

stream.foreachRDD {rdd =>
  rdd.groupBy(rdd.key).flatMap { line =>...}.map(...).map(...)
}

Does this do what I expect? How about between batches? Does it guarrantee the 
same key goes to the same executor in all batches?

Thanks,

Lin


Re: Spark streaming routing

2016-01-07 Thread Lin Zhao
Thanks for the replay Tathagata. Our pipeline has a rather fat state and that's 
why we have custom failure handling that kills all executors and go back to a 
certain point in time in the past.

On a separate but related note, I noticed that in a chained map job, the entire 
pipeline runs on the same thread.

Say I have

dstream.map(func1).map(func2).map(func3)

For the same input func1, func2 and func3 run on the same thread. Is there a 
way to configure Spark such that they run in the same executor but different 
threads? Our piple line has a hight memory footprint and need a low cpu:memory 
ratio.


From: Tathagata Das >
Date: Thursday, January 7, 2016 at 1:56 PM
To: Lin Zhao >
Cc: user >
Subject: Re: Spark streaming routing

You cannot guarantee that each key will forever be on the same executor. That 
is flawed approach to designing an application if you have to take ensure 
fault-tolerance toward executor failures.

On Thu, Jan 7, 2016 at 9:34 AM, Lin Zhao 
> wrote:
I have a need to route the dstream through the streming pipeline by some key, 
such that data with the same key always goes through the same executor.

There doesn't seem to be a way to do manual routing with Spark Streaming. The 
closest I can come up with is:

stream.foreachRDD {rdd =>
  rdd.groupBy(rdd.key).flatMap { line =>...}.map(...).map(...)
}

Does this do what I expect? How about between batches? Does it guarrantee the 
same key goes to the same executor in all batches?

Thanks,

Lin



Re: Large scale ranked recommendation

2016-01-07 Thread xenocyon
(following up a rather old thread:)

Hi Christopher,

I understand how you might use nearest neighbors for item-item
recommendations, but how do you use it for top N items per user?

Thanks!

Apu



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Large-scale-ranked-recommendation-tp10098p25913.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Predictive Modelling in sparkR

2016-01-07 Thread Chandan Verma
Hi yanbo, 

I was able to successfully perform logistic regression on my data and also 
performed the cross validation and it all worked fine.
Thanks 

Sent from my Sony Xperia™ smartphone

 Yanbo Liang wrote 

>Hi Chandan,
>
>
>Do you mean to run your own LR algorithm based on SparkR?
>
>Actually, SparkR provide the ability to run the distributed Spark MLlib LR and 
>the interface is similar with the R GLM.
>
>For your refer: 
>https://spark.apache.org/docs/latest/sparkr.html#binomial-glm-model 
>
>
>2016-01-07 2:45 GMT+08:00 Chandan Verma :
>
>Has anyone tried building logistic regression model in SparkR.. Is it 
>recommended?  Does it take longer to do process than what can be done in 
>simple R?
>
> 
>
>===
> DISCLAIMER: The information contained in this message (including any 
>attachments) is confidential and may be privileged. If you have received it by 
>mistake please notify the sender by return e-mail and permanently delete this 
>message and any attachments from your system. Any dissemination, use, review, 
>distribution, printing or copying of this message in whole or in part is 
>strictly prohibited. Please note that e-mails are susceptible to change. 
>CitiusTech shall not be liable for the improper or incomplete transmission of 
>the information contained in this communication nor for any delay in its 
>receipt or damage to your system. CitiusTech does not guarantee that the 
>integrity of this communication has been maintained or that this communication 
>is free of viruses, interceptions or interferences. 
>
> 
>
>


===
DISCLAIMER:
The information contained in this message (including any attachments) is 
confidential and may be privileged. If you have received it by mistake please 
notify the sender by return e-mail and permanently delete this message and any 
attachments from your system. Any dissemination, use, review, distribution, 
printing or copying of this message in whole or in part is strictly prohibited. 
Please note that e-mails are susceptible to change. CitiusTech shall not be 
liable for the improper or incomplete transmission of the information contained 
in this communication nor for any delay in its receipt or damage to your 
system. CitiusTech does not guarantee that the integrity of this communication 
has been maintained or that this communication is free of viruses, 
interceptions or interferences. 




Re: [Spark-SQL] Custom aggregate function for GrouppedData

2016-01-07 Thread Abhishek Gayakwad
Thanks Michael for replying, Aggregator/UDAF is exactly what I am looking
for, but are still on 1.4 and it's gonna take time to get 1.6.

On Wed, Jan 6, 2016 at 10:32 AM, Michael Armbrust 
wrote:

> In Spark 1.6 GroupedDataset
> 
>  has
> mapGroups, which sounds like what you are looking for.  You can also write
> a custom Aggregator
> 
>
> On Tue, Jan 5, 2016 at 8:14 PM, Abhishek Gayakwad 
> wrote:
>
>> Hello Hivemind,
>>
>> Referring to this thread -
>> https://forums.databricks.com/questions/956/how-do-i-group-my-dataset-by-a-key-or-combination.html.
>> I have learnt that we can not do much with groupped data apart from using
>> existing aggregate functions. This blog post was written in may 2015, I
>> don't know if things are changes from that point of time. I am using 1.4
>> version of spark.
>>
>> What I am trying to achieve is something very similar to collectset in
>> hive (actually unique ordered concated values.) e.g.
>>
>> 1,2
>> 1,3
>> 2,4
>> 2,5
>> 2,4
>>
>> to
>> 1, "2,3"
>> 2, "4,5"
>>
>> Currently I am achieving this by converting dataframe to RDD, do the
>> required operations and convert it back to dataframe as shown below.
>>
>> public class AvailableSizes implements Serializable {
>>
>> public DataFrame calculate(SQLContext ssc, DataFrame salesDataFrame) {
>> final JavaRDD rowJavaRDD = salesDataFrame.toJavaRDD();
>>
>> JavaPairRDD pairs = rowJavaRDD.mapToPair(
>> (PairFunction) row -> {
>> final Object[] objects = {row.getAs(0), row.getAs(1), 
>> row.getAs(3)};
>> return new 
>> Tuple2<>(row.getAs(SalesColumns.STYLE.name()), new 
>> GenericRowWithSchema(objects, SalesColumns.getOutputSchema()));
>> });
>>
>> JavaPairRDD withSizeList = pairs.reduceByKey(new 
>> Function2() {
>> @Override
>> public Row call(Row aRow, Row bRow) {
>> final String uniqueCommaSeparatedSizes = uniqueSizes(aRow, 
>> bRow);
>> final Object[] objects = {aRow.getAs(0), aRow.getAs(1), 
>> uniqueCommaSeparatedSizes};
>> return new GenericRowWithSchema(objects, 
>> SalesColumns.getOutputSchema());
>> }
>>
>> private String uniqueSizes(Row aRow, Row bRow) {
>> final SortedSet allSizes = new TreeSet<>();
>> final List aSizes = Arrays.asList(((String) 
>> aRow.getAs(String.valueOf(SalesColumns.SIZE))).split(","));
>> final List bSizes = Arrays.asList(((String) 
>> bRow.getAs(String.valueOf(SalesColumns.SIZE))).split(","));
>> allSizes.addAll(aSizes);
>> allSizes.addAll(bSizes);
>> return csvFormat(allSizes);
>> }
>> });
>>
>> final JavaRDD values = withSizeList.values();
>>
>> return ssc.createDataFrame(values, SalesColumns.getOutputSchema());
>>
>> }
>>
>> public String csvFormat(Collection collection) {
>> return 
>> collection.stream().map(Object::toString).collect(Collectors.joining(","));
>> }
>> }
>>
>> Please suggest if there is a better way of doing this.
>>
>> Regards,
>> Abhishek
>>
>
>


Re: Newbie question

2016-01-07 Thread Deepak Sharma
Yes , you can do it unless the method is marked static/final.
Most of the methods in SparkContext are marked static so you can't over
ride them definitely , else over ride would work usually.

Thanks
Deepak

On Fri, Jan 8, 2016 at 12:06 PM, yuliya Feldman  wrote:

> Hello,
>
> I am new to Spark and have a most likely basic question - can I override a
> method from SparkContext?
>
> Thanks
>



-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: Newbie question

2016-01-07 Thread censj
You can try it.  
> 在 2016年1月8日,14:44,yuliya Feldman  写道:
> 
> invoked



Recommendations using Spark

2016-01-07 Thread anjali gautam
Hi,

Can anybody please guide me how can we create generate recommendations for
a user using spark?

Regards,
Anjali Gautam


Re: Recommendations using Spark

2016-01-07 Thread dEEPU
Use spark mlib kmeans algorithm to generate recommendations
On Jan 8, 2016 12:41 PM, anjali gautam  wrote:
Hi,

Can anybody please guide me how can we create generate recommendations for
a user using spark?

Regards,
Anjali Gautam


Re: Date Time Regression as Feature

2016-01-07 Thread dEEPU
Maybe u want to convert the date to a duration in form of number of hours/days 
and then do calculation on it
On Jan 8, 2016 12:39 AM, Jorge Machado  wrote:
Hello all,

I'm new to machine learning. I'm trying to predict some electric usage  with a 
decision  Free
The data is :
2015-12-10-10:00, 1200
2015-12-11-10:00, 1150

My question is : What is the best way to turn date and time into feature on my 
Vector ?

Something like this :  Vector (1200, [2015,12,10,10,10] )?
I could not fine any example with value prediction where features had dates in 
it.

Thanks

Jorge Machado

Jorge Machado
jo...@jmachado.me


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Newbie question

2016-01-07 Thread yuliya Feldman
Hello,
I am new to Spark and have a most likely basic question - can I override a 
method from SparkContext?
Thanks

Re: Newbie question

2016-01-07 Thread yuliya Feldman
For example to add some functionality there.
I understand I can have extended SparkContext as an implicit class to add new 
methods that can be invoked on SparkContext, but I want to see if I can 
override existing one.

 

  From: censj 
 To: yuliya Feldman  
Cc: "user@spark.apache.org" 
 Sent: Thursday, January 7, 2016 10:38 PM
 Subject: Re: Newbie question
   
why to override a method from SparkContext?
在 2016年1月8日,14:36,yuliya Feldman  写道:
Hello,
I am new to Spark and have a most likely basic question - can I override a 
method from SparkContext?
Thanks



  

RE: How to split a huge rdd and broadcast it by turns?

2016-01-07 Thread LINChen
Hi kdmxen,You want to delete the broadcast variables on the executors to avoid 
executors lost failure, right?Have you try to use the unpersist method? Like 
this way:itemSplitBroadcast.destroy(true); => 
itemSplitBroadcast.unpersist(true); 
LIN Chen

Date: Thu, 7 Jan 2016 22:01:27 +0800
Subject: How to split a huge rdd and broadcast it by turns?
From: kdm...@gmail.com
To: user@spark.apache.org

Description:
Our spark version is 1.4.1
we want to join two huge RDD, one of them with skew data. so the spark rdd 
operation join may lead memory problem. We try to split one of smaller one to 
several pieces then broadcast them in batches. On each broadcast turn, we try 
to collect one part of smaller rdd to driver, then save it to HashMap, then 
broadcast the HashMap. Each executor use the broadcast value to do map 
operation with the bigger rdd. We implement our skew data join through this way.
But when it process broadcast value in each turn. we find that we can not 
destroy our broadcast value after processing. if we use broadcast.destroy(), 
next turn we processing data willtrigger errors. like this:java.io.IOException: 
org.apache.spark.SparkException: Attempted to use Broadcast(6) after it was 
destroyed (destroy at xxx.java:369)
we have viewed the source code of spark, and find this problem is leaded by rdd 
depedency relationship. if rdd3 -> rdd2 -> rdd1 (the arrow shows dependecy). 
and rdd1 is producted by using broadcast variable named b1, rdd2 used b2. when 
producing rdd3, the source code shows it need to serialize b1 and b2. if b1 or 
b2 is destroyed before rdd3 producing process. it will cause a failure which I 
list above.
Question:
Does it exist way can let rdd3 forget its depedency and make it don't require 
b1, b2, only required rdd2 during its producing process?
Or Does it exist way to deal with skew join problem?
by the way, we have set checkpoint for each turn. and set spark.cleaner.ttl to 
600. the problem are still there. if we dont destory broadcast variable, 
executor will lost in 5th turn.
our code is like this:
for (int i = 0; i < times; i++) {   
JavaPairRDD, Double> prevItemPairRdd = curItemPairRdd;   
 List> itemSplit = itemZippedRdd 
   .filter(new FilterByHashFunction(times, i)).collect();
Map itemSplitMap = new HashMap();   
 for (Tuple2 item : itemSplit) {
itemSplitMap.put(item._1(), item._2());}
Broadcast> itemSplitBroadcast = jsc
.broadcast(itemSplitMap);
curItemPairRdd = prevItemPairRdd.mapToPair(new 
NormalizeScoreFunction(itemSplitBroadcast))
.persist(StorageLevel.DISK_ONLY());curItemPairRdd.count();
itemSplitBroadcast.destroy(true);itemSplit.clear();
} 

Re: Newbie question

2016-01-07 Thread yuliya Feldman
Thank you
 

  From: Deepak Sharma 
 To: yuliya Feldman  
Cc: "user@spark.apache.org" 
 Sent: Thursday, January 7, 2016 10:41 PM
 Subject: Re: Newbie question
   
Yes , you can do it unless the method is marked static/final.Most of the 
methods in SparkContext are marked static so you can't over ride them 
definitely , else over ride would work usually.
ThanksDeepak
On Fri, Jan 8, 2016 at 12:06 PM, yuliya Feldman  
wrote:

Hello,
I am new to Spark and have a most likely basic question - can I override a 
method from SparkContext?
Thanks



-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net

   

RE: Recommendations using Spark

2016-01-07 Thread Singh, Abhijeet
The question itself is very vague.

You might want to use this slide as a starting point 
http://www.slideshare.net/CasertaConcepts/analytics-week-recommendations-on-spark.

From: anjali gautam [mailto:anjali.gauta...@gmail.com]
Sent: Friday, January 08, 2016 12:42 PM
To: user@spark.apache.org
Subject: Recommendations using Spark

Hi,

Can anybody please guide me how can we create generate recommendations for a 
user using spark?

Regards,
Anjali Gautam


Spark Context not getting initialized in local mode

2016-01-07 Thread Rahul Kumar
Hi all,
I am trying to start solr with a custom plugin which uses spark library. I
am trying to initialize sparkcontext in local mode. I have made a fat jar
for this plugin using maven shade and put it in the lib directory. *While
starting solr it is not able to initialize sparkcontext.* It says class not
found exception for AkkaRpcEnvFactory. Can anyone please help.

*It gives the following error:*

3870 [coreLoadExecutor-4-thread-1] ERROR org.apache.spark.SparkContext
 – Error initializing SparkContext.
java.lang.ClassNotFoundException:org.apache.spark.rpc.akka.AkkaRpcEnvFactory

*Here is the detailed error*

java -jar start.jar0[main] INFO  org.eclipse.jetty.server.Server
– jetty-8.1.10.v2013031227   [main] INFO
org.eclipse.jetty.deploy.providers.ScanningAppProvider  – Deployment
monitor /home/rahul/solr-4.7.2/example/contexts at interval 040
[main] INFO  org.eclipse.jetty.deploy.DeploymentManager  – Deployable
added: /home/rahul/solr-4.7.2/example/contexts/solr-jetty-context.xml1095
[main] INFO  org.eclipse.jetty.webapp.StandardDescriptorProcessor  –
NO JSP Support for /solr, did not find
org.apache.jasper.servlet.JspServlet1155 [main] INFO
org.apache.solr.servlet.SolrDispatchFilter  –
SolrDispatchFilter.init()1189 [main] INFO
org.apache.solr.core.SolrResourceLoader  – JNDI not configured for
solr (NoInitialContextEx)1190 [main] INFO
org.apache.solr.core.SolrResourceLoader  – solr home defaulted to
'solr/' (could not find system property or JNDI)1190 [main] INFO
org.apache.solr.core.SolrResourceLoader  – new SolrResourceLoader for
directory: 'solr/'1280 [main] INFO  org.apache.solr.core.ConfigSolr  –
Loading container configuration from
/home/rahul/solr-4.7.2/example/solr/solr.xml1458 [main] INFO
org.apache.solr.core.CoresLocator  – Config-defined core root
directory: /home/rahul/solr-4.7.2/example/solr1465 [main] INFO
org.apache.solr.core.CoreContainer  – New CoreContainer
602710225...
3870 [coreLoadExecutor-4-thread-1] ERROR org.apache.spark.SparkContext
 – Error initializing SparkContext.
java.lang.ClassNotFoundException: org.apache.spark.rpc.akka.AkkaRpcEnvFactory
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at 
org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:430)
at 
org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:383)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:274)
at org.apache.spark.util.Utils$.classForName(Utils.scala:173)
at org.apache.spark.rpc.RpcEnv$.getRpcEnvFactory(RpcEnv.scala:42)
at org.apache.spark.rpc.RpcEnv$.create(RpcEnv.scala:53)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:252)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:193)
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:276)
at org.apache.spark.SparkContext.(SparkContext.scala:441)
at 
org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:61)
at 
com.snapdeal.search.spark.SparkLoadModel.loadModel(SparkLoadModel.java:11)
at 
com.snapdeal.search.valuesource.parser.RankingModelValueSourceParser.init(RankingModelValueSourceParser.java:29)
at org.apache.solr.core.SolrCore.createInitInstance(SolrCore.java:591)
at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:2191)
at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:2185)
at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:2218)
at org.apache.solr.core.SolrCore.initValueSourceParsers(SolrCore.java:2130)
at org.apache.solr.core.SolrCore.(SolrCore.java:765)
at org.apache.solr.core.SolrCore.(SolrCore.java:630)
at 
org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:562)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:597)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:258)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:250)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)3880
[coreLoadExecutor-4-thread-1] INFO  org.apache.spark.SparkContext  –
Successfully stopped SparkContext





Rahul Kumar
*Software Engineer- I (Search Snapdeal)*

*M*: +91 9023542950*EXT: *14226
362-363, ASF CENTRE , UDYOG VIHAR , PHASE - IV , GURGAON 122 016 , 

Re: Newbie question

2016-01-07 Thread dEEPU
If the method is not final or static then u can
On Jan 8, 2016 12:07 PM, yuliya Feldman  wrote:
Hello,
I am new to Spark and have a most likely basic question - can I override a 
method from SparkContext?
Thanks


Re: Recommendations using Spark

2016-01-07 Thread Stephen Boesch
Alternating least squares takes  an RDD of (user/product/ratings) tuples
and the resulting Model provides predict(user, product) or predictProducts
methods among others.


Re: How to load specific Hive partition in DataFrame Spark 1.6?

2016-01-07 Thread Yin Huai
Hi, we made the change because the partitioning discovery logic was too
flexible and it introduced problems that were very confusing to users. To
make your case work, we have introduced a new data source option called
basePath. You can use

DataFrame df = hiveContext.read().format("orc").option("basePath", "
path/to/table/").load("path/to/table/entity=xyz")

So, the partitioning discovery logic will understand that the base
path is path/to/table/
and your dataframe will has the column "entity".

You can find the doc at the end of partitioning discovery section of the
sql programming guide (
http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery
).

Thanks,

Yin

On Thu, Jan 7, 2016 at 7:34 AM, unk1102  wrote:

> Hi from Spark 1.6 onwards as per this  doc
> <
> http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery
> >
> We cant add specific hive partitions to DataFrame
>
> spark 1.5 the following used to work and the following dataframe will have
> entity column
>
> DataFrame df =
> hiveContext.read().format("orc").load("path/to/table/entity=xyz")
>
> But in Spark 1.6 above does not work and I have to give base path like the
> following but it does not contain entity column which I want in DataFrame
>
> DataFrame df = hiveContext.read().format("orc").load("path/to/table/")
>
> How do I load specific hive partition in a dataframe? What was the driver
> behind removing this feature which was efficient I believe now above Spark
> 1.6 code load all partitions and if I filter for specific partitions it is
> not efficient it hits memory and throws GC error because of thousands of
> partitions get loaded into memory and not the specific one please guide.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-load-specific-Hive-partition-in-DataFrame-Spark-1-6-tp25904.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: How to load specific Hive partition in DataFrame Spark 1.6?

2016-01-07 Thread Umesh Kacha
Hi Yin, thanks much your answer solved my problem. Really appreciate it!

Regards


On Fri, Jan 8, 2016 at 1:26 AM, Yin Huai  wrote:

> Hi, we made the change because the partitioning discovery logic was too
> flexible and it introduced problems that were very confusing to users. To
> make your case work, we have introduced a new data source option called
> basePath. You can use
>
> DataFrame df = hiveContext.read().format("orc").option("basePath", "
> path/to/table/").load("path/to/table/entity=xyz")
>
> So, the partitioning discovery logic will understand that the base path is 
> path/to/table/
> and your dataframe will has the column "entity".
>
> You can find the doc at the end of partitioning discovery section of the
> sql programming guide (
> http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery
> ).
>
> Thanks,
>
> Yin
>
> On Thu, Jan 7, 2016 at 7:34 AM, unk1102  wrote:
>
>> Hi from Spark 1.6 onwards as per this  doc
>> <
>> http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery
>> >
>> We cant add specific hive partitions to DataFrame
>>
>> spark 1.5 the following used to work and the following dataframe will have
>> entity column
>>
>> DataFrame df =
>> hiveContext.read().format("orc").load("path/to/table/entity=xyz")
>>
>> But in Spark 1.6 above does not work and I have to give base path like the
>> following but it does not contain entity column which I want in DataFrame
>>
>> DataFrame df = hiveContext.read().format("orc").load("path/to/table/")
>>
>> How do I load specific hive partition in a dataframe? What was the driver
>> behind removing this feature which was efficient I believe now above Spark
>> 1.6 code load all partitions and if I filter for specific partitions it is
>> not efficient it hits memory and throws GC error because of thousands of
>> partitions get loaded into memory and not the specific one please guide.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-load-specific-Hive-partition-in-DataFrame-Spark-1-6-tp25904.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: How to load specific Hive partition in DataFrame Spark 1.6?

2016-01-07 Thread Yin Huai
No problem! Glad it helped!

On Thu, Jan 7, 2016 at 12:05 PM, Umesh Kacha  wrote:

> Hi Yin, thanks much your answer solved my problem. Really appreciate it!
>
> Regards
>
>
> On Fri, Jan 8, 2016 at 1:26 AM, Yin Huai  wrote:
>
>> Hi, we made the change because the partitioning discovery logic was too
>> flexible and it introduced problems that were very confusing to users. To
>> make your case work, we have introduced a new data source option called
>> basePath. You can use
>>
>> DataFrame df = hiveContext.read().format("orc").option("basePath", "
>> path/to/table/").load("path/to/table/entity=xyz")
>>
>> So, the partitioning discovery logic will understand that the base path
>> is path/to/table/ and your dataframe will has the column "entity".
>>
>> You can find the doc at the end of partitioning discovery section of the
>> sql programming guide (
>> http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery
>> ).
>>
>> Thanks,
>>
>> Yin
>>
>> On Thu, Jan 7, 2016 at 7:34 AM, unk1102  wrote:
>>
>>> Hi from Spark 1.6 onwards as per this  doc
>>> <
>>> http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery
>>> >
>>> We cant add specific hive partitions to DataFrame
>>>
>>> spark 1.5 the following used to work and the following dataframe will
>>> have
>>> entity column
>>>
>>> DataFrame df =
>>> hiveContext.read().format("orc").load("path/to/table/entity=xyz")
>>>
>>> But in Spark 1.6 above does not work and I have to give base path like
>>> the
>>> following but it does not contain entity column which I want in DataFrame
>>>
>>> DataFrame df = hiveContext.read().format("orc").load("path/to/table/")
>>>
>>> How do I load specific hive partition in a dataframe? What was the driver
>>> behind removing this feature which was efficient I believe now above
>>> Spark
>>> 1.6 code load all partitions and if I filter for specific partitions it
>>> is
>>> not efficient it hits memory and throws GC error because of thousands of
>>> partitions get loaded into memory and not the specific one please guide.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-load-specific-Hive-partition-in-DataFrame-Spark-1-6-tp25904.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>


Date Time Regression as Feature

2016-01-07 Thread Jorge Machado
Hello all, 

I'm new to machine learning. I'm trying to predict some electric usage  with a 
decision  Free 
The data is : 
2015-12-10-10:00, 1200
2015-12-11-10:00, 1150

My question is : What is the best way to turn date and time into feature on my 
Vector ? 

Something like this :  Vector (1200, [2015,12,10,10,10] )? 
I could not fine any example with value prediction where features had dates in 
it.

Thanks 

Jorge Machado 

Jorge Machado 
jo...@jmachado.me


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



"impossible to get artifacts " error when using sbt to build 1.6.0 for scala 2.11

2016-01-07 Thread Lin Zhao
I tried to build 1.6.0 for yarn and scala 2.11, but have an error. Any help is 
appreciated.


[warn] Strategy 'first' was applied to 2 files

[info] Assembly up to date: 
/Users/lin/git/spark/network/yarn/target/scala-2.11/spark-network-yarn-1.6.0-hadoop2.7.1.jar

java.lang.IllegalStateException: impossible to get artifacts when data has not 
been loaded. IvyNode = org.slf4j#slf4j-log4j12;1.7.6

at org.apache.ivy.core.resolve.IvyNode.getArtifacts(IvyNode.java:809)

at org.apache.ivy.core.resolve.IvyNode.getSelectedArtifacts(IvyNode.java:786)

at 
org.apache.ivy.core.report.ResolveReport.setDependencies(ResolveReport.java:235)

at org.apache.ivy.core.resolve.ResolveEngine.resolve(ResolveEngine.java:235)

at org.apache.ivy.Ivy.resolve(Ivy.java:517)

at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:266)

at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:175)

at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:157)

at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:151)

at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:151)

at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:128)

at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:56)

at sbt.IvySbt$$anon$4.call(Ivy.scala:64)

at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:93)

at 
xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:78)

at xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:97)

at xsbt.boot.Using$.withResource(Using.scala:10)

at xsbt.boot.Using$.apply(Using.scala:9)

at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:58)

at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:48)

at xsbt.boot.Locks$.apply0(Locks.scala:31)

at xsbt.boot.Locks$.apply(Locks.scala:28)

at sbt.IvySbt.withDefaultLogger(Ivy.scala:64)

at sbt.IvySbt.withIvy(Ivy.scala:123)

at sbt.IvySbt.withIvy(Ivy.scala:120)

at sbt.IvySbt$Module.withModule(Ivy.scala:151)

at sbt.IvyActions$.updateEither(IvyActions.scala:157)

at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1318)

Command I ran to build:

>git chekout v1.6.0
>./dev/change-scala-version.sh 2.11
>build/sbt -Pyarn -Phadoop-2.6 -Dhadoop.version=2.7.1 -Dscala-2.11 
>-Phadoop-provided assembly



Problems with reading data from parquet files in a HDFS remotely

2016-01-07 Thread Henrik Baastrup
Hi All,

I have a small Hadoop cluster where I have stored a lot of data in parquet 
files. I have installed a Spark master service on one of the nodes and now 
would like to query my parquet files from a Spark client. When I run the 
following program from the spark-shell on the Spark Master node all function 
correct:

# val sqlCont = new org.apache.spark.sql.SQLContext(sc)
# val reader = sqlCont.read
# val dataFrame = reader.parquet("/user/hdfs/parquet-multi/BICC")
# dataFrame.registerTempTable("BICC")
# val recSet = sqlCont.sql("SELECT 
protocolCode,beginTime,endTime,called,calling FROM BICC WHERE 
endTime>=14494218 AND endTime<=14494224 AND 
calling='6287870642893' AND p_endtime=14494224")
# recSet.show()  

But when I run the Java program below, from my client, I get: 

Exception in thread "main" java.lang.AssertionError: assertion failed: No 
predefined schema found, and no Parquet data files or summary files found under 
file:/user/hdfs/parquet-multi/BICC.

The exception occurs at the line: DataFrame df = 
reader.parquet("/user/hdfs/parquet-multi/BICC");

On the Master node I can see the client connect when the SparkContext is 
instanced, as I get the following lines in the Spark log:

16/01/07 18:27:47 INFO Master: Registering app SparkTest
16/01/07 18:27:47 INFO Master: Registered app SparkTest with ID 
app-20160107182747-00801

If I create a local directory with the given path, my program goes in an 
endless loop, with the following warning on the console:

WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; 
check your cluster UI to ensure that workers are registered and have sufficient 
resources

To me it seams that my SQLContext does not connect to the Spark Master, but try 
to work locally on the client, where the requested files do not exist.

Java program:
SparkConf conf = new SparkConf()
.setAppName("SparkTest")
.setMaster("spark://172.27.13.57:7077");
JavaSparkContext sc = new JavaSparkContext(conf);
SQLContext sqlContext = new SQLContext(sc);

DataFrameReader reader = sqlContext.read();
DataFrame df = reader.parquet("/user/hdfs/parquet-multi/BICC");
DataFrame filtered = df.filter("endTime>=14494218 AND 
endTime<=14494224 AND calling='6287870642893' AND 
p_endtime=14494224");
filtered.show();

Are there someone there can help me?

Henrik



Re: Spark job uses only one Worker

2016-01-07 Thread Michael Pisula
Hi,

I start the cluster using the spark-ec2 scripts, so the cluster is in
stand-alone mode.
Here is how I submit my job:
spark/bin/spark-submit --class demo.spark.StaticDataAnalysis --master
spark://:6066 --deploy-mode cluster demo/Demo-1.0-SNAPSHOT-all.jar

Cheers,
Michael

On 07.01.2016 22:41, Igor Berman wrote:
> share how you submit your job
> what cluster(yarn, standalone)
>
> On 7 January 2016 at 23:24, Michael Pisula  > wrote:
>
> Hi there,
>
> I ran a simple Batch Application on a Spark Cluster on EC2.
> Despite having 3
> Worker Nodes, I could not get the application processed on more
> than one
> node, regardless if I submitted the Application in Cluster or
> Client mode.
> I also tried manually increasing the number of partitions in the
> code, no
> effect. I also pass the master into the application.
> I verified on the nodes themselves that only one node was active
> while the
> job was running.
> I pass enough data to make the job take 6 minutes to process.
> The job is simple enough, reading data from two S3 files, joining
> records on
> a shared field, filtering out some records and writing the result
> back to
> S3.
>
> Tried all kinds of stuff, but could not make it work. I did find
> similar
> questions, but had already tried the solutions that worked in
> those cases.
> Would be really happy about any pointers.
>
> Cheers,
> Michael
>
>
>
> --
> View this message in context:
> 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-uses-only-one-Worker-tp25909.html
> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> 
> For additional commands, e-mail: user-h...@spark.apache.org
> 
>
>

-- 
Michael Pisula * michael.pis...@tngtech.com * +49-174-3180084
TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
Sitz: Unterföhring * Amtsgericht München * HRB 135082



Re: Spark job uses only one Worker

2016-01-07 Thread Igor Berman
read about *--total-executor-cores*
not sure why you specify port 6066 in master...usually it's 7077
verify in master ui(usually port 8080) how many cores are there(depends on
other configs, but usually workers connect to master with all their cores)

On 7 January 2016 at 23:46, Michael Pisula 
wrote:

> Hi,
>
> I start the cluster using the spark-ec2 scripts, so the cluster is in
> stand-alone mode.
> Here is how I submit my job:
> spark/bin/spark-submit --class demo.spark.StaticDataAnalysis --master
> spark://:6066 --deploy-mode cluster demo/Demo-1.0-SNAPSHOT-all.jar
>
> Cheers,
> Michael
>
>
> On 07.01.2016 22:41, Igor Berman wrote:
>
> share how you submit your job
> what cluster(yarn, standalone)
>
> On 7 January 2016 at 23:24, Michael Pisula 
> wrote:
>
>> Hi there,
>>
>> I ran a simple Batch Application on a Spark Cluster on EC2. Despite
>> having 3
>> Worker Nodes, I could not get the application processed on more than one
>> node, regardless if I submitted the Application in Cluster or Client mode.
>> I also tried manually increasing the number of partitions in the code, no
>> effect. I also pass the master into the application.
>> I verified on the nodes themselves that only one node was active while the
>> job was running.
>> I pass enough data to make the job take 6 minutes to process.
>> The job is simple enough, reading data from two S3 files, joining records
>> on
>> a shared field, filtering out some records and writing the result back to
>> S3.
>>
>> Tried all kinds of stuff, but could not make it work. I did find similar
>> questions, but had already tried the solutions that worked in those cases.
>> Would be really happy about any pointers.
>>
>> Cheers,
>> Michael
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-uses-only-one-Worker-tp25909.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
> --
> Michael Pisula * michael.pis...@tngtech.com * +49-174-3180084
> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
> Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
> Sitz: Unterföhring * Amtsgericht München * HRB 135082
>
>


Re: spark ui security

2016-01-07 Thread Kostiantyn Kudriavtsev
can I do it without kerberos and hadoop?
ideally using filters as for job UI

On Jan 7, 2016, at 1:22 PM, Prem Sure  wrote:

> you can refer more on https://searchcode.com/codesearch/view/97658783/
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SecurityManager.scala
> 
> spark.authenticate = true
> spark.ui.acls.enable = true
> spark.ui.view.acls = user1,user2
> spark.ui.filters = 
> org.apache.hadoop.security.authentication.server.AuthenticationFilter
> spark.org.apache.hadoop.security.authentication.server.AuthenticationFilter.params="type=kerberos,kerberos.principal=HTTP/mybox@MYDOMAIN,kerberos.keytab=/some/keytab"
> 
> 
> 
> 
> On Thu, Jan 7, 2016 at 10:35 AM, Kostiantyn Kudriavtsev 
>  wrote:
> I’m afraid I missed where this property must be specified? I added it to 
> spark-xxx.conf which is basically configurable per job, so I assume to 
> protect WebUI the different place must be used, isn’t it?
> 
> On Jan 7, 2016, at 10:28 AM, Ted Yu  wrote:
> 
>> According to https://spark.apache.org/docs/latest/security.html#web-ui , web 
>> UI is covered.
>> 
>> FYI
>> 
>> On Thu, Jan 7, 2016 at 6:35 AM, Kostiantyn Kudriavtsev 
>>  wrote:
>> hi community,
>> 
>> do I understand correctly that spark.ui.filters property sets up filters 
>> only for jobui interface? is it any way to protect spark web ui in the same 
>> way?
>> 
> 
> 



Re: Spark streaming routing

2016-01-07 Thread Tathagata Das
You cannot guarantee that each key will forever be on the same executor.
That is flawed approach to designing an application if you have to take
ensure fault-tolerance toward executor failures.

On Thu, Jan 7, 2016 at 9:34 AM, Lin Zhao  wrote:

> I have a need to route the dstream through the streming pipeline by some
> key, such that data with the same key always goes through the same
> executor.
>
> There doesn't seem to be a way to do manual routing with Spark Streaming.
> The closest I can come up with is:
>
> stream.foreachRDD {rdd =>
>   rdd.groupBy(rdd.key).flatMap { line =>…}.map(…).map(…)
> }
>
> Does this do what I expect? How about between batches? Does it guarrantee
> the same key goes to the same executor in all batches?
>
> Thanks,
>
> Lin
>


Re: spark ui security

2016-01-07 Thread Ted Yu
Without kerberos you don't have true security.

Cheers

On Thu, Jan 7, 2016 at 1:56 PM, Kostiantyn Kudriavtsev <
kudryavtsev.konstan...@gmail.com> wrote:

> can I do it without kerberos and hadoop?
> ideally using filters as for job UI
>
> On Jan 7, 2016, at 1:22 PM, Prem Sure  wrote:
>
> you can refer more on https://searchcode.com/codesearch/view/97658783/
>
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SecurityManager.scala
>
> spark.authenticate = true
> spark.ui.acls.enable = true
> spark.ui.view.acls = user1,user2
> spark.ui.filters =
> org.apache.hadoop.security.authentication.server.AuthenticationFilter
>
> spark.org.apache.hadoop.security.authentication.server.AuthenticationFilter.params="type=kerberos,kerberos.principal=HTTP/mybox@MYDOMAIN
> ,kerberos.keytab=/some/keytab"
>
>
>
>
> On Thu, Jan 7, 2016 at 10:35 AM, Kostiantyn Kudriavtsev <
> kudryavtsev.konstan...@gmail.com> wrote:
>
>> I’m afraid I missed where this property must be specified? I added it to
>> spark-xxx.conf which is basically configurable per job, so I assume to
>> protect WebUI the different place must be used, isn’t it?
>>
>> On Jan 7, 2016, at 10:28 AM, Ted Yu  wrote:
>>
>> According to https://spark.apache.org/docs/latest/security.html#web-ui ,
>> web UI is covered.
>>
>> FYI
>>
>> On Thu, Jan 7, 2016 at 6:35 AM, Kostiantyn Kudriavtsev <
>> kudryavtsev.konstan...@gmail.com> wrote:
>>
>>> hi community,
>>>
>>> do I understand correctly that spark.ui.filters property sets up
>>> filters only for jobui interface? is it any way to protect spark web ui in
>>> the same *way?*
>>>
>>
>>
>>
>
>


RE: Question in rdd caching in memory using persist

2016-01-07 Thread seemanto.barua
Attached is the screen shot of the storage tab details for the cached rdd. The 
host highlighted and at the end of the list is driver machine.


[cid:image001.png@01D1496C.FD79AB40]

-regards
Seemanto Barua


From: Barua, Seemanto (US)
Sent: Thursday, January 07, 2016 12:43 PM
To: 'premsure...@gmail.com'
Cc: 'user@spark.apache.org'
Subject: Re: Question in rdd caching in memory using persist

I have a standalone cluster. spark version is 1.3.1


From: Prem Sure [mailto:premsure...@gmail.com]
Sent: Thursday, January 07, 2016 12:32 PM
To: Barua, Seemanto (US)
Cc: spark users >
Subject: Re: Question in rdd caching in memory using persist

are you running standalone - local mode or cluster mode. executor and driver 
existance differ based on setup type. snapshot of your env UI would be helpful 
to say

On Thu, Jan 7, 2016 at 11:51 AM, 
> wrote:
Hi,


After I called rdd.persist(MEMORY_ONLY_SER), I see the driver listed as one of 
the ‘executors’ participating in holding the partitions of the rdd in memory, 
the memory usage shown against the driver is 0. This I see in the storage tab 
of the spark ui.
Why is the driver shown on the ui ? Will it ever hold rdd partitions when 
caching.


-regards
Seemanto Barua



PLEASE READ: This message is for the named person's use only. It may contain 
confidential, proprietary or legally privileged information. No confidentiality 
or privilege is waived or lost by any mistransmission. If you receive this 
message in error, please delete it and all copies from your system, destroy any 
hard copies and notify the sender. You must not, directly or indirectly, use, 
disclose, distribute, print, or copy any part of this message if you are not 
the intended recipient. Nomura Holding America Inc., Nomura Securities 
International, Inc, and their respective subsidiaries each reserve the right to 
monitor all e-mail communications through its networks. Any views expressed in 
this message are those of the individual sender, except where the message 
states otherwise and the sender is authorized to state the views of such 
entity. Unless otherwise stated, any pricing information in this message is 
indicative only, is subject to change and does not constitute an offer to deal 
at any price quoted. Any reference to the terms of executed transactions should 
be treated as preliminary only and subject to our formal written confirmation.



PLEASE READ: This message is for the named person's use only. It may contain 
confidential, proprietary or legally privileged information. No confidentiality 
or privilege is waived or lost by any mistransmission. If you receive this 
message in error, please delete it and all copies from your system, destroy any 
hard copies and notify the sender. You must not, directly or indirectly, use, 
disclose, distribute, print, or copy any part of this message if you are not 
the intended recipient. Nomura Holding America Inc., Nomura Securities 
International, Inc, and their respective subsidiaries each reserve the right to 
monitor all e-mail communications through its networks. Any views expressed in 
this message are those of the individual sender, except where the message 
states otherwise and the sender is authorized to state the views of such 
entity. Unless otherwise stated, any pricing information in this message is 
indicative only, is subject to change and does not constitute an offer to deal 
at any price quoted. Any reference to the terms of executed transactions should 
be treated as preliminary only and subject to our formal written confirmation.



Re: Spark job uses only one Worker

2016-01-07 Thread Michael Pisula
I had tried several parameters, including --total-executor-cores, no effect.
As for the port, I tried 7077, but if I remember correctly I got some
kind of error that suggested to try 6066, with which it worked just fine
(apart from this issue here).

Each worker has two cores. I also tried increasing cores, again no
effect. I was able to increase the number of cores the job was using on
one worker, but it would not use any other worker (and it would not
start if the number of cores the job wanted was higher than the number
available on one worker).

On 07.01.2016 22:51, Igor Berman wrote:
> read about *--total-executor-cores*
> not sure why you specify port 6066 in master...usually it's 7077
> verify in master ui(usually port 8080) how many cores are
> there(depends on other configs, but usually workers connect to master
> with all their cores)
>
> On 7 January 2016 at 23:46, Michael Pisula  > wrote:
>
> Hi,
>
> I start the cluster using the spark-ec2 scripts, so the cluster is
> in stand-alone mode.
> Here is how I submit my job:
> spark/bin/spark-submit --class demo.spark.StaticDataAnalysis
> --master spark://:6066 --deploy-mode cluster
> demo/Demo-1.0-SNAPSHOT-all.jar
>
> Cheers,
> Michael
>
>
> On 07.01.2016 22:41, Igor Berman wrote:
>> share how you submit your job
>> what cluster(yarn, standalone)
>>
>> On 7 January 2016 at 23:24, Michael Pisula
>> >
>> wrote:
>>
>> Hi there,
>>
>> I ran a simple Batch Application on a Spark Cluster on EC2.
>> Despite having 3
>> Worker Nodes, I could not get the application processed on
>> more than one
>> node, regardless if I submitted the Application in Cluster or
>> Client mode.
>> I also tried manually increasing the number of partitions in
>> the code, no
>> effect. I also pass the master into the application.
>> I verified on the nodes themselves that only one node was
>> active while the
>> job was running.
>> I pass enough data to make the job take 6 minutes to process.
>> The job is simple enough, reading data from two S3 files,
>> joining records on
>> a shared field, filtering out some records and writing the
>> result back to
>> S3.
>>
>> Tried all kinds of stuff, but could not make it work. I did
>> find similar
>> questions, but had already tried the solutions that worked in
>> those cases.
>> Would be really happy about any pointers.
>>
>> Cheers,
>> Michael
>>
>>
>>
>> --
>> View this message in context:
>> 
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-uses-only-one-Worker-tp25909.html
>> Sent from the Apache Spark User List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> 
>> For additional commands, e-mail: user-h...@spark.apache.org
>> 
>>
>>
>
> -- 
> Michael Pisula * michael.pis...@tngtech.com 
>  * +49-174-3180084 
> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
> Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
> Sitz: Unterföhring * Amtsgericht München * HRB 135082
>
>

-- 
Michael Pisula * michael.pis...@tngtech.com * +49-174-3180084
TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
Sitz: Unterföhring * Amtsgericht München * HRB 135082



Re: spark ui security

2016-01-07 Thread Kostiantyn Kudriavtsev
I know, but I need only to hide/protect web ui at least with servlet/filter api 

On Jan 7, 2016, at 4:59 PM, Ted Yu  wrote:

> Without kerberos you don't have true security.
> 
> Cheers
> 
> On Thu, Jan 7, 2016 at 1:56 PM, Kostiantyn Kudriavtsev 
>  wrote:
> can I do it without kerberos and hadoop?
> ideally using filters as for job UI
> 
> On Jan 7, 2016, at 1:22 PM, Prem Sure  wrote:
> 
>> you can refer more on https://searchcode.com/codesearch/view/97658783/
>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SecurityManager.scala
>> 
>> spark.authenticate = true
>> spark.ui.acls.enable = true
>> spark.ui.view.acls = user1,user2
>> spark.ui.filters = 
>> org.apache.hadoop.security.authentication.server.AuthenticationFilter
>> spark.org.apache.hadoop.security.authentication.server.AuthenticationFilter.params="type=kerberos,kerberos.principal=HTTP/mybox@MYDOMAIN,kerberos.keytab=/some/keytab"
>> 
>> 
>> 
>> 
>> On Thu, Jan 7, 2016 at 10:35 AM, Kostiantyn Kudriavtsev 
>>  wrote:
>> I’m afraid I missed where this property must be specified? I added it to 
>> spark-xxx.conf which is basically configurable per job, so I assume to 
>> protect WebUI the different place must be used, isn’t it?
>> 
>> On Jan 7, 2016, at 10:28 AM, Ted Yu  wrote:
>> 
>>> According to https://spark.apache.org/docs/latest/security.html#web-ui , 
>>> web UI is covered.
>>> 
>>> FYI
>>> 
>>> On Thu, Jan 7, 2016 at 6:35 AM, Kostiantyn Kudriavtsev 
>>>  wrote:
>>> hi community,
>>> 
>>> do I understand correctly that spark.ui.filters property sets up filters 
>>> only for jobui interface? is it any way to protect spark web ui in the same 
>>> way?
>>> 
>> 
>> 
> 
> 



Re: Spark job uses only one Worker

2016-01-07 Thread Igor Berman
do you see in master ui that workers connected to master & before you are
running your app there are 2 available cores in master ui per each worker?
I understand that there are 2 cores on each worker - the question is do
they got registered under master

regarding port it's very strange, please post what is problem connecting to
7077

use *--total-executor-cores 4 in your submit*

if you can post master ui screen after you submitted your app


On 8 January 2016 at 00:02, Michael Pisula 
wrote:

> I had tried several parameters, including --total-executor-cores, no
> effect.
> As for the port, I tried 7077, but if I remember correctly I got some kind
> of error that suggested to try 6066, with which it worked just fine (apart
> from this issue here).
>
> Each worker has two cores. I also tried increasing cores, again no effect.
> I was able to increase the number of cores the job was using on one worker,
> but it would not use any other worker (and it would not start if the number
> of cores the job wanted was higher than the number available on one worker).
>
>
> On 07.01.2016 22:51, Igor Berman wrote:
>
> read about *--total-executor-cores*
> not sure why you specify port 6066 in master...usually it's 7077
> verify in master ui(usually port 8080) how many cores are there(depends on
> other configs, but usually workers connect to master with all their cores)
>
> On 7 January 2016 at 23:46, Michael Pisula 
> wrote:
>
>> Hi,
>>
>> I start the cluster using the spark-ec2 scripts, so the cluster is in
>> stand-alone mode.
>> Here is how I submit my job:
>> spark/bin/spark-submit --class demo.spark.StaticDataAnalysis --master
>> spark://:6066 --deploy-mode cluster demo/Demo-1.0-SNAPSHOT-all.jar
>>
>> Cheers,
>> Michael
>>
>>
>> On 07.01.2016 22:41, Igor Berman wrote:
>>
>> share how you submit your job
>> what cluster(yarn, standalone)
>>
>> On 7 January 2016 at 23:24, Michael Pisula < 
>> michael.pis...@tngtech.com> wrote:
>>
>>> Hi there,
>>>
>>> I ran a simple Batch Application on a Spark Cluster on EC2. Despite
>>> having 3
>>> Worker Nodes, I could not get the application processed on more than one
>>> node, regardless if I submitted the Application in Cluster or Client
>>> mode.
>>> I also tried manually increasing the number of partitions in the code, no
>>> effect. I also pass the master into the application.
>>> I verified on the nodes themselves that only one node was active while
>>> the
>>> job was running.
>>> I pass enough data to make the job take 6 minutes to process.
>>> The job is simple enough, reading data from two S3 files, joining
>>> records on
>>> a shared field, filtering out some records and writing the result back to
>>> S3.
>>>
>>> Tried all kinds of stuff, but could not make it work. I did find similar
>>> questions, but had already tried the solutions that worked in those
>>> cases.
>>> Would be really happy about any pointers.
>>>
>>> Cheers,
>>> Michael
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> 
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-uses-only-one-Worker-tp25909.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: 
>>> user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: 
>>> user-h...@spark.apache.org
>>>
>>>
>>
>> --
>> Michael Pisula * michael.pis...@tngtech.com * +49-174-3180084
>> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
>> Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
>> Sitz: Unterföhring * Amtsgericht München * HRB 135082
>>
>>
>
> --
> Michael Pisula * michael.pis...@tngtech.com * +49-174-3180084
> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
> Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
> Sitz: Unterföhring * Amtsgericht München * HRB 135082
>
>


Re: Spark job uses only one Worker

2016-01-07 Thread Michael Pisula
All the workers were connected, I even saw the job being processed on
different workers, so that was working fine.
I will fire up the cluster again tomorrow and post the results of
connecting to 7077 and using --total-executor-cores 4.

Thanks for the help

On 07.01.2016 23:10, Igor Berman wrote:
> do you see in master ui that workers connected to master & before you
> are running your app there are 2 available cores in master ui per each
> worker?
> I understand that there are 2 cores on each worker - the question is
> do they got registered under master
>
> regarding port it's very strange, please post what is problem
> connecting to 7077
>
> use *--total-executor-cores 4 in your submit*
> *
> *
> if you can post master ui screen after you submitted your app
>
>
> On 8 January 2016 at 00:02, Michael Pisula  > wrote:
>
> I had tried several parameters, including --total-executor-cores,
> no effect.
> As for the port, I tried 7077, but if I remember correctly I got
> some kind of error that suggested to try 6066, with which it
> worked just fine (apart from this issue here).
>
> Each worker has two cores. I also tried increasing cores, again no
> effect. I was able to increase the number of cores the job was
> using on one worker, but it would not use any other worker (and it
> would not start if the number of cores the job wanted was higher
> than the number available on one worker).
>
>
> On 07.01.2016 22:51, Igor Berman wrote:
>> read about *--total-executor-cores*
>> not sure why you specify port 6066 in master...usually it's 7077
>> verify in master ui(usually port 8080) how many cores are
>> there(depends on other configs, but usually workers connect to
>> master with all their cores)
>>
>> On 7 January 2016 at 23:46, Michael Pisula
>> >
>> wrote:
>>
>> Hi,
>>
>> I start the cluster using the spark-ec2 scripts, so the
>> cluster is in stand-alone mode.
>> Here is how I submit my job:
>> spark/bin/spark-submit --class demo.spark.StaticDataAnalysis
>> --master spark://:6066 --deploy-mode cluster
>> demo/Demo-1.0-SNAPSHOT-all.jar
>>
>> Cheers,
>> Michael
>>
>>
>> On 07.01.2016 22:41, Igor Berman wrote:
>>> share how you submit your job
>>> what cluster(yarn, standalone)
>>>
>>> On 7 January 2016 at 23:24, Michael Pisula
>>> >> > wrote:
>>>
>>> Hi there,
>>>
>>> I ran a simple Batch Application on a Spark Cluster on
>>> EC2. Despite having 3
>>> Worker Nodes, I could not get the application processed
>>> on more than one
>>> node, regardless if I submitted the Application in
>>> Cluster or Client mode.
>>> I also tried manually increasing the number of
>>> partitions in the code, no
>>> effect. I also pass the master into the application.
>>> I verified on the nodes themselves that only one node
>>> was active while the
>>> job was running.
>>> I pass enough data to make the job take 6 minutes to
>>> process.
>>> The job is simple enough, reading data from two S3
>>> files, joining records on
>>> a shared field, filtering out some records and writing
>>> the result back to
>>> S3.
>>>
>>> Tried all kinds of stuff, but could not make it work. I
>>> did find similar
>>> questions, but had already tried the solutions that
>>> worked in those cases.
>>> Would be really happy about any pointers.
>>>
>>> Cheers,
>>> Michael
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> 
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-uses-only-one-Worker-tp25909.html
>>> Sent from the Apache Spark User List mailing list
>>> archive at Nabble.com.
>>>
>>> 
>>> -
>>> To unsubscribe, e-mail:
>>> user-unsubscr...@spark.apache.org
>>> 
>>> For additional commands, e-mail:
>>> user-h...@spark.apache.org
>>> 
>>>
>>>
>>
>> -- 
>> Michael Pisula * michael.pis...@tngtech.com 
>>  * +49-174-3180084 
>> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
>> Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
>> 

Re: Spark job uses only one Worker

2016-01-07 Thread Igor Berman
share how you submit your job
what cluster(yarn, standalone)

On 7 January 2016 at 23:24, Michael Pisula 
wrote:

> Hi there,
>
> I ran a simple Batch Application on a Spark Cluster on EC2. Despite having
> 3
> Worker Nodes, I could not get the application processed on more than one
> node, regardless if I submitted the Application in Cluster or Client mode.
> I also tried manually increasing the number of partitions in the code, no
> effect. I also pass the master into the application.
> I verified on the nodes themselves that only one node was active while the
> job was running.
> I pass enough data to make the job take 6 minutes to process.
> The job is simple enough, reading data from two S3 files, joining records
> on
> a shared field, filtering out some records and writing the result back to
> S3.
>
> Tried all kinds of stuff, but could not make it work. I did find similar
> questions, but had already tried the solutions that worked in those cases.
> Would be really happy about any pointers.
>
> Cheers,
> Michael
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-uses-only-one-Worker-tp25909.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


adding jars - hive on spark cdh 5.4.3

2016-01-07 Thread Ophir Etzion
I' trying to add jars before running a query using hive on spark on cdh
5.4.3.
I've tried applying the patch in
https://issues.apache.org/jira/browse/HIVE-12045 (manually as the patch is
done on a different hive version) but still hasn't succeeded.

did anyone manage to do ADD JAR successfully with CDH?

Thanks,
Ophir


Spark job uses only one Worker

2016-01-07 Thread Michael Pisula
Hi there,

I ran a simple Batch Application on a Spark Cluster on EC2. Despite having 3
Worker Nodes, I could not get the application processed on more than one
node, regardless if I submitted the Application in Cluster or Client mode.
I also tried manually increasing the number of partitions in the code, no
effect. I also pass the master into the application.
I verified on the nodes themselves that only one node was active while the
job was running.
I pass enough data to make the job take 6 minutes to process.
The job is simple enough, reading data from two S3 files, joining records on
a shared field, filtering out some records and writing the result back to
S3.

Tried all kinds of stuff, but could not make it work. I did find similar
questions, but had already tried the solutions that worked in those cases.
Would be really happy about any pointers.

Cheers,
Michael



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-uses-only-one-Worker-tp25909.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



SparkContext SyntaxError: invalid syntax

2016-01-07 Thread weineran
Hello,

When I try to submit a python job using spark-submit (using --master yarn
--deploy-mode cluster), I get the following error:

/Traceback (most recent call last):
  File "loss_rate_by_probe.py", line 15, in ?
from pyspark import SparkContext
  File
"/scratch5/hadoop/yarn/local/usercache//filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/__init__.py",
line 41, in ?
  File
"/scratch5/hadoop/yarn/local/usercache//filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/context.py",
line 219
with SparkContext._lock:
^
SyntaxError: invalid syntax/

This is very similar to  this post from 2014

 
, but unlike that person I am using Python 2.7.8.

Here is what I'm using:
Spark 1.3.1
Hadoop 2.4.0.2.1.5.0-695
Python 2.7.8

Another clue:  I also installed Spark 1.6.0 and tried to submit the same
job.  I got a similar error:

/Traceback (most recent call last):
  File "loss_rate_by_probe.py", line 15, in ?
from pyspark import SparkContext
  File
"/scratch5/hadoop/yarn/local/usercache//appcache/application_1450370639491_0119/container_1450370639491_0119_01_01/pyspark.zip/pyspark/__init__.py",
line 61
indent = ' ' * (min(len(m) for m in indents) if indents else 0)
  ^
SyntaxError: invalid syntax/

Any thoughts?

Andrew



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-SyntaxError-invalid-syntax-tp25910.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org