Re: ArrayBuffer within a DataFrame

2015-04-03 Thread Denny Lee
 Sweet - I'll have to play with this then! :)
On Fri, Apr 3, 2015 at 19:43 Reynold Xin  wrote:

> There is already an explode function on DataFrame btw
>
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L712
>
> I think something like this would work. You might need to play with the
> type.
>
> df.explode("arrayBufferColumn") { x => x }
>
>
>
> On Fri, Apr 3, 2015 at 6:43 AM, Denny Lee  wrote:
>
>> Thanks Dean - fun hack :)
>>
>> On Fri, Apr 3, 2015 at 6:11 AM Dean Wampler 
>> wrote:
>>
>>> A hack workaround is to use flatMap:
>>>
>>> rdd.flatMap{ case (date, array) => for (x <- array) yield (date, x) }
>>>
>>> For those of you who don't know Scala, the for comprehension iterates
>>> through the ArrayBuffer, named "array" and yields new tuples with the date
>>> and each element. The case expression to the left of the => pattern matches
>>> on the input tuples.
>>>
>>> Dean Wampler, Ph.D.
>>> Author: Programming Scala, 2nd Edition
>>>  (O'Reilly)
>>> Typesafe 
>>> @deanwampler 
>>> http://polyglotprogramming.com
>>>
>>> On Thu, Apr 2, 2015 at 10:45 PM, Denny Lee 
>>> wrote:
>>>
 Thanks Michael - that was it!  I was drawing a blank on this one for
 some reason - much appreciated!


 On Thu, Apr 2, 2015 at 8:27 PM Michael Armbrust 
 wrote:

> A lateral view explode using HiveQL.  I'm hopping to add explode
> shorthand directly to the df API in 1.4.
>
> On Thu, Apr 2, 2015 at 7:10 PM, Denny Lee 
> wrote:
>
>> Quick question - the output of a dataframe is in the format of:
>>
>> [2015-04, ArrayBuffer(A, B, C, D)]
>>
>> and I'd like to return it as:
>>
>> 2015-04, A
>> 2015-04, B
>> 2015-04, C
>> 2015-04, D
>>
>> What's the best way to do this?
>>
>> Thanks in advance!
>>
>>
>>
>
>>>
>


Re: ArrayBuffer within a DataFrame

2015-04-03 Thread Reynold Xin
There is already an explode function on DataFrame btw

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L712

I think something like this would work. You might need to play with the
type.

df.explode("arrayBufferColumn") { x => x }



On Fri, Apr 3, 2015 at 6:43 AM, Denny Lee  wrote:

> Thanks Dean - fun hack :)
>
> On Fri, Apr 3, 2015 at 6:11 AM Dean Wampler  wrote:
>
>> A hack workaround is to use flatMap:
>>
>> rdd.flatMap{ case (date, array) => for (x <- array) yield (date, x) }
>>
>> For those of you who don't know Scala, the for comprehension iterates
>> through the ArrayBuffer, named "array" and yields new tuples with the date
>> and each element. The case expression to the left of the => pattern matches
>> on the input tuples.
>>
>> Dean Wampler, Ph.D.
>> Author: Programming Scala, 2nd Edition
>>  (O'Reilly)
>> Typesafe 
>> @deanwampler 
>> http://polyglotprogramming.com
>>
>> On Thu, Apr 2, 2015 at 10:45 PM, Denny Lee  wrote:
>>
>>> Thanks Michael - that was it!  I was drawing a blank on this one for
>>> some reason - much appreciated!
>>>
>>>
>>> On Thu, Apr 2, 2015 at 8:27 PM Michael Armbrust 
>>> wrote:
>>>
 A lateral view explode using HiveQL.  I'm hopping to add explode
 shorthand directly to the df API in 1.4.

 On Thu, Apr 2, 2015 at 7:10 PM, Denny Lee 
 wrote:

> Quick question - the output of a dataframe is in the format of:
>
> [2015-04, ArrayBuffer(A, B, C, D)]
>
> and I'd like to return it as:
>
> 2015-04, A
> 2015-04, B
> 2015-04, C
> 2015-04, D
>
> What's the best way to do this?
>
> Thanks in advance!
>
>
>

>>


Re: kmeans|| in Spark is not real paralleled?

2015-04-03 Thread Xi Shen
Hi Xingrui,

I have create JIRA https://issues.apache.org/jira/browse/SPARK-6706, and
attached the sample code. But I could not attache the test data. I will
update the bug once I found a place to host the test data.


Thanks,
David


On Tue, Mar 31, 2015 at 8:18 AM Xiangrui Meng  wrote:

> This PR updated the k-means|| initialization:
> https://github.com/apache/spark/commit/ca7910d6dd7693be2a675a0d6a6fcc9eb0aaeb5d,
> which was included in 1.3.0. It should fix kmean|| initialization with
> large k. Please create a JIRA for this issue and send me the code and the
> dataset to produce this problem. Thanks! -Xiangrui
>
> On Sun, Mar 29, 2015 at 1:20 AM, Xi Shen  wrote:
>
>> Hi,
>>
>> I have opened a couple of threads asking about k-means performance
>> problem in Spark. I think I made a little progress.
>>
>> Previous I use the simplest way of KMeans.train(rdd, k, maxIterations).
>> It uses the "kmeans||" initialization algorithm which supposedly to be a
>> faster version of kmeans++ and give better results in general.
>>
>> But I observed that if the k is very large, the initialization step takes
>> a long time. From the CPU utilization chart, it looks like only one thread
>> is working. Please see
>> https://stackoverflow.com/questions/29326433/cpu-gap-when-doing-k-means-with-spark
>> .
>>
>> I read the paper,
>> http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf, and it
>> points out kmeans++ initialization algorithm will suffer if k is large.
>> That's why the paper contributed the kmeans|| algorithm.
>>
>>
>> If I invoke KMeans.train by using the random initialization algorithm, I
>> do not observe this problem, even with very large k, like k=5000. This
>> makes me suspect that the kmeans|| in Spark is not properly implemented and
>> do not utilize parallel implementation.
>>
>>
>> I have also tested my code and data set with Spark 1.3.0, and I still
>> observe this problem. I quickly checked the PR regarding the KMeans
>> algorithm change from 1.2.0 to 1.3.0. It seems to be only code improvement
>> and polish, not changing/improving the algorithm.
>>
>>
>> I originally worked on Windows 64bit environment, and I also tested on
>> Linux 64bit environment. I could provide the code and data set if anyone
>> want to reproduce this problem.
>>
>>
>> I hope a Spark developer could comment on this problem and help
>> identifying if it is a bug.
>>
>>
>> Thanks,
>>
>> [image: --]
>> Xi Shen
>> [image: http://]about.me/davidshen
>> 
>>   
>>
>
>


Re: Tableau + Spark SQL Thrift Server + Cassandra

2015-04-03 Thread Todd Nist
Thanks Mohammed,

I was aware of Calliope, but haven't used it since with since the
spark-cassandra-connector project got released.  I was not aware of the
CalliopeServer2; cool thanks for sharing that one.

I would appreciate it if you could lmk how you decide to proceed with this;
I can see this coming up on my radar in the next few months; thanks.

-Todd

On Fri, Apr 3, 2015 at 5:53 PM, Mohammed Guller 
wrote:

>  Thanks, Todd.
>
>
>
> It is an interesting idea; worth trying.
>
>
>
> I think the cash project is old. The tuplejump guy has created another
> project called CalliopeServer2, which works like a charm with BI tools that
> use JDBC, but unfortunately Tableau throws an error when it connects to it.
>
>
>
> Mohammed
>
>
>
> *From:* Todd Nist [mailto:tsind...@gmail.com]
> *Sent:* Friday, April 3, 2015 11:39 AM
> *To:* pawan kumar
> *Cc:* Mohammed Guller; user@spark.apache.org
>
> *Subject:* Re: Tableau + Spark SQL Thrift Server + Cassandra
>
>
>
> Hi Mohammed,
>
>
>
> Not sure if you have tried this or not.  You could try using the below api
> to start the thriftserver with an existing context.
>
>
> https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L42
>
> The one thing that Michael Ambrust @ databrick recommended was this:
>
> You can start a JDBC server with an existing context.  See my answer here:
> http://apache-spark-user-list.1001560.n3.nabble.com/Standard-SQL-tool-access-to-SchemaRDD-td20197.html
>
> So something like this based on example from Cheng Lian:
>
>
> * Server*
>
> import  org.apache.spark.sql.hive.HiveContext
>
> import  org.apache.spark.sql.catalyst.types._
>
>
>
> val  sparkContext  =  sc
>
> import  sparkContext._
>
> val  sqlContext  =  new  HiveContext(sparkContext)
>
> import  sqlContext._
>
> makeRDD((1,"hello") :: (2,"world") 
> ::Nil).toSchemaRDD.cache().registerTempTable("t")
>
> // replace the above with the C* + spark-casandra-connectore to generate 
> SchemaRDD and registerTempTable
>
>
>
> import  org.apache.spark.sql.hive.thriftserver._
>
> HiveThriftServer2.startWithContext(sqlContext)
>
>   Then Startup
>
> ./bin/beeline -u jdbc:hive2://localhost:1/default
>
> 0: jdbc:hive2://localhost:1/default> select * from t;
>
>
>
>   I have not tried this yet from Tableau.   My understanding is that the
> tempTable is only valid as long as the sqlContext is, so if one terminates
> the code representing the *Server*, and then restarts the standard thrift
> server, sbin/start-thriftserver ..., the table won't be available.
>
>
>
> Another possibility is to perhaps use the tuplejump cash project,
> https://github.com/tuplejump/cash.
>
>
>
> HTH.
>
>
>
> -Todd
>
>
>
> On Fri, Apr 3, 2015 at 11:11 AM, pawan kumar  wrote:
>
> Thanks mohammed. Will give it a try today. We would also need the
> sparksSQL piece as we are migrating our data store from oracle to C* and it
> would be easier to maintain all the reports rather recreating each one from
> scratch.
>
> Thanks,
> Pawan Venugopal.
>
> On Apr 3, 2015 7:59 AM, "Mohammed Guller"  wrote:
>
> Hi Todd,
>
>
>
> We are using Apache C* 2.1.3, not DSE. We got Tableau to work directly
> with C* using the ODBC driver, but now would like to add Spark SQL to the
> mix. I haven’t been able to find any documentation for how to make this
> combination work.
>
>
>
> We are using the Spark-Cassandra-Connector in our applications, but
> haven’t been able to figure out how to get the Spark SQL Thrift Server to
> use it and connect to C*. That is the missing piece. Once we solve that
> piece of the puzzle then Tableau should be able to see the tables in C*.
>
>
>
> Hi Pawan,
>
> Tableau + C* is pretty straight forward, especially if you are using DSE.
> Create a new DSN in Tableau using the ODBC driver that comes with DSE. Once
> you connect, Tableau allows to use C* keyspace as schema and column
> families as tables.
>
>
>
> Mohammed
>
>
>
> *From:* pawan kumar [mailto:pkv...@gmail.com]
> *Sent:* Friday, April 3, 2015 7:41 AM
> *To:* Todd Nist
> *Cc:* user@spark.apache.org; Mohammed Guller
> *Subject:* Re: Tableau + Spark SQL Thrift Server + Cassandra
>
>
>
> Hi Todd,
>
> Thanks for the link. I would be interested in this solution. I am using
> DSE for cassandra. Would you provide me with info on connecting with DSE
> either through Tableau or zeppelin. The goal here is query cassandra
> through spark sql so that I could perform joins and groupby on my queries.
> Are you able to perform spark sql queries with tableau?
>
> Thanks,
> Pawan Venugopal
>
> On Apr 3, 2015 5:03 AM, "Todd Nist"  wrote:
>
> What version of Cassandra are you using?  Are you using DSE or the stock
> Apache Cassandra version?  I have connected it with DSE, but have not
> attempted it with the standard Apache Cassandra version.
>
>
>
> FWIW,
> http://www.datastax.com/dev/blog/datastax-odbc-cql-connector-apache-cassandra-datastax-enterprise,
> provides an ODBC driver tor ac

Migrating from Spark 0.8.0 to Spark 1.3.0

2015-04-03 Thread Ritesh Kumar Singh
Hi,

Are there any tutorials that explains all the changelogs between Spark
0.8.0 and Spark 1.3.0 and how can we approach this issue.


Re: spark mesos deployment : starting workers based on attributes

2015-04-03 Thread Ankur Chauhan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi,

Thanks! I'll add the JIRA. I'll also try to work on a patch this weekend
.

- -- Ankur Chauhan

On 03/04/2015 13:23, Tim Chen wrote:
> Hi Ankur,
> 
> There isn't a way to do that yet, but it's simple to add.
> 
> Can you create a JIRA in Spark for this?
> 
> Thanks!
> 
> Tim
> 
> On Fri, Apr 3, 2015 at 1:08 PM, Ankur Chauhan
> mailto:achau...@brightcove.com>> wrote:
> 
> Hi,
> 
> I am trying to figure out if there is a way to tell the mesos 
> scheduler in spark to isolate the workers to a set of mesos slaves 
> that have a given attribute such as `tachyon:true`.
> 
> Anyone knows if that is possible or how I could achieve such a
> behavior.
> 
> Thanks! -- Ankur Chauhan
> 
> -
>
> 
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>  For additional commands,
> e-mail: user-h...@spark.apache.org 
> 
> 
> 

- -- 
- -- Ankur Chauhan
-BEGIN PGP SIGNATURE-

iQEcBAEBAgAGBQJVHxMDAAoJEOSJAMhvLp3LEPAH/1T7Ywu2W2vEZR/f6KbP+xbd
CiECqbgy1lMw0TxK3jyoiGttTL0uDcgoqev5kjaUFaGgcpsbzZg2jiaqM5RagJRv
55HvGXtSXKQ3l5NlRyMsbmRGVu8qoV2qv2qrCQHLKhVc0ipXEQgSjrkDGx9yP397
Dz1tFMsY/bgvQL0nMAm/HwJokv701IDGeFXFNI4GXhLGcARYDHou4bY0nzZq+w8t
V9vEFji4jyroJmacHdX0np3KsA6tzVItD6Wi9tLKr0+UWDw2Fb1HfYK0CPYX+FK8
dEgZ/hKwNolAzfIF6kHyNKEIf6H6GKihdLxaB23Im7QojvgGNBTqfGV4tGoJLPc=
=KyHk
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Streaming FileStream Nested File Support

2015-04-03 Thread Tathagata Das
Yes, definitely can be added. Just haven't gotten around to doing it :)
There are proposals for this that you can try -
https://github.com/apache/spark/pull/2765/files . Have you review it at
some point.

On Fri, Apr 3, 2015 at 1:08 PM, Adam Ritter  wrote:

> That doesn't seem like a good solution unfortunately as I would be needing
> this to work in a production environment.  Do you know why the limitation
> exists for FileInputDStream in the first place?  Unless I'm missing
> something important about how some of the internals work I don't see why
> this feature could be added in at some point.
>
> On Fri, Apr 3, 2015 at 12:47 PM, Tathagata Das 
> wrote:
>
>> I sort-a-hacky workaround is to use a queueStream where you can manually
>> create RDDs (using sparkContext.hadoopFile) and insert into the queue. Note
>> that this is for testing only as queueStream does not work with driver
>> fautl recovery.
>>
>> TD
>>
>> On Fri, Apr 3, 2015 at 12:23 PM, adamgerst  wrote:
>>
>>> So after pulling my hair out for a bit trying to convert one of my
>>> standard
>>> spark jobs to streaming I found that FileInputDStream does not support
>>> nested folders (see the brief mention here
>>>
>>> http://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources
>>> the fileStream method returns a FileInputDStream).  So before, for my
>>> standard job, I was reading from say
>>>
>>> s3n://mybucket/2015/03/02/*log
>>>
>>> And could also modify it to simply get an entire months worth of logs.
>>> Since the logs are split up based upon their date, when the batch ran for
>>> the day, I simply passed in a parameter of the date to make sure I was
>>> reading the correct data
>>>
>>> But since I want to turn this job into a streaming job I need to simply
>>> do
>>> something like
>>>
>>> s3n://mybucket/*log
>>>
>>> This would totally work fine if it were a standard spark application, but
>>> fails for streaming.  Is there anyway I can get around this limitation?
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-FileStream-Nested-File-Support-tp22370.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>


Re: WordCount example

2015-04-03 Thread Tathagata Das
What does the Spark Standalone UI at port 8080 say about number of cores?

On Fri, Apr 3, 2015 at 2:53 PM, Mohit Anchlia 
wrote:

> [ec2-user@ip-10-241-251-232 s_lib]$ cat /proc/cpuinfo |grep process
> processor   : 0
> processor   : 1
> processor   : 2
> processor   : 3
> processor   : 4
> processor   : 5
> processor   : 6
> processor   : 7
>
> On Fri, Apr 3, 2015 at 2:33 PM, Tathagata Das  wrote:
>
>> How many cores are present in the works allocated to the standalone
>> cluster spark://ip-10-241-251-232:7077 ?
>>
>>
>> On Fri, Apr 3, 2015 at 2:18 PM, Mohit Anchlia 
>> wrote:
>>
>>> If I use local[2] instead of *URL:* spark://ip-10-241-251-232:7077 this
>>> seems to work. I don't understand why though because when I
>>> give spark://ip-10-241-251-232:7077 application seem to bootstrap
>>> successfully, just doesn't create a socket on port ?
>>>
>>>
>>> On Fri, Mar 27, 2015 at 10:55 AM, Mohit Anchlia 
>>> wrote:
>>>
 I checked the ports using netstat and don't see any connections
 established on that port. Logs show only this:

 15/03/27 13:50:48 INFO Master: Registering app NetworkWordCount
 15/03/27 13:50:48 INFO Master: Registered app NetworkWordCount with ID
 app-20150327135048-0002

 Spark ui shows:

 Running Applications
 IDNameCoresMemory per NodeSubmitted TimeUserStateDuration
 app-20150327135048-0002
 
 NetworkWordCount
 0512.0 
 MB2015/03/27
 13:50:48ec2-userWAITING33 s
 Code looks like is being executed:

 java -cp .:* org.spark.test.WordCount spark://ip-10-241-251-232:7077

 *public* *static* *void* doWork(String masterUrl){

 SparkConf conf = *new* SparkConf().setMaster(masterUrl).setAppName(
 "NetworkWordCount");

 JavaStreamingContext *jssc* = *new* JavaStreamingContext(conf,
 Durations.*seconds*(1));

 JavaReceiverInputDStream lines = jssc.socketTextStream(
 "localhost", );

 System.*out*.println("Successfully created connection");

 *mapAndReduce*(lines);

  jssc.start(); // Start the computation

 jssc.awaitTermination(); // Wait for the computation to terminate

 }

 *public* *static* *void* main(String ...args){

 *doWork*(args[0]);

 }
 And output of the java program after submitting the task:

 java -cp .:* org.spark.test.WordCount spark://ip-10-241-251-232:7077
 Using Spark's default log4j profile:
 org/apache/spark/log4j-defaults.properties
 15/03/27 13:50:46 INFO SecurityManager: Changing view acls to: ec2-user
 15/03/27 13:50:46 INFO SecurityManager: Changing modify acls to:
 ec2-user
 15/03/27 13:50:46 INFO SecurityManager: SecurityManager: authentication
 disabled; ui acls disabled; users with view permissions: Set(ec2-user);
 users with modify permissions: Set(ec2-user)
 15/03/27 13:50:46 INFO Slf4jLogger: Slf4jLogger started
 15/03/27 13:50:46 INFO Remoting: Starting remoting
 15/03/27 13:50:47 INFO Remoting: Remoting started; listening on
 addresses
 :[akka.tcp://sparkdri...@ip-10-241-251-232.us-west-2.compute.internal
 :60184]
 15/03/27 13:50:47 INFO Utils: Successfully started service
 'sparkDriver' on port 60184.
 15/03/27 13:50:47 INFO SparkEnv: Registering MapOutputTracker
 15/03/27 13:50:47 INFO SparkEnv: Registering BlockManagerMaster
 15/03/27 13:50:47 INFO DiskBlockManager: Created local directory at
 /tmp/spark-local-20150327135047-5399
 15/03/27 13:50:47 INFO MemoryStore: MemoryStore started with capacity
 3.5 GB
 15/03/27 13:50:47 WARN NativeCodeLoader: Unable to load native-hadoop
 library for your platform... using builtin-java classes where applicable
 15/03/27 13:50:47 INFO HttpFileServer: HTTP File server directory is
 /tmp/spark-7e26df49-1520-4c77-b411-c837da59fa5b
 15/03/27 13:50:47 INFO HttpServer: Starting HTTP Server
 15/03/27 13:50:47 INFO Utils: Successfully started service 'HTTP file
 server' on port 57955.
 15/03/27 13:50:47 INFO Utils: Successfully started service 'SparkUI' on
 port 4040.
 15/03/27 13:50:47 INFO SparkUI: Started SparkUI at
 http://ip-10-241-251-232.us-west-2.compute.internal:4040
 15/03/27 13:50:47 INFO AppClient$ClientActor: Connecting to master
 spark://ip-10-241-251-232:7077...
 15/03/27 13:50:48 INFO SparkDeploySchedulerBackend: Connected to Spark
 cluster with app ID app-20150327135048-0002
 15/03/27 13:50:48 INFO NettyBlockTransferService: Server created on
 58358
 15/03/27 13:50:48 INFO BlockManagerMaster: Trying to register
 BlockManager
 15/03/27 13:50:48 INFO BlockManagerMasterActor: Registering block
 manager ip-10-241-251-232.us-west-2.compute.internal:58358 with 3.5 GB RAM,
 BlockManagerId(, ip

Re: WordCount example

2015-04-03 Thread Mohit Anchlia
[ec2-user@ip-10-241-251-232 s_lib]$ cat /proc/cpuinfo |grep process
processor   : 0
processor   : 1
processor   : 2
processor   : 3
processor   : 4
processor   : 5
processor   : 6
processor   : 7

On Fri, Apr 3, 2015 at 2:33 PM, Tathagata Das  wrote:

> How many cores are present in the works allocated to the standalone
> cluster spark://ip-10-241-251-232:7077 ?
>
>
> On Fri, Apr 3, 2015 at 2:18 PM, Mohit Anchlia 
> wrote:
>
>> If I use local[2] instead of *URL:* spark://ip-10-241-251-232:7077 this
>> seems to work. I don't understand why though because when I
>> give spark://ip-10-241-251-232:7077 application seem to bootstrap
>> successfully, just doesn't create a socket on port ?
>>
>>
>> On Fri, Mar 27, 2015 at 10:55 AM, Mohit Anchlia 
>> wrote:
>>
>>> I checked the ports using netstat and don't see any connections
>>> established on that port. Logs show only this:
>>>
>>> 15/03/27 13:50:48 INFO Master: Registering app NetworkWordCount
>>> 15/03/27 13:50:48 INFO Master: Registered app NetworkWordCount with ID
>>> app-20150327135048-0002
>>>
>>> Spark ui shows:
>>>
>>> Running Applications
>>> IDNameCoresMemory per NodeSubmitted TimeUserStateDuration
>>> app-20150327135048-0002
>>> 
>>> NetworkWordCount
>>> 0512.0 
>>> MB2015/03/27
>>> 13:50:48ec2-userWAITING33 s
>>> Code looks like is being executed:
>>>
>>> java -cp .:* org.spark.test.WordCount spark://ip-10-241-251-232:7077
>>>
>>> *public* *static* *void* doWork(String masterUrl){
>>>
>>> SparkConf conf = *new* SparkConf().setMaster(masterUrl).setAppName(
>>> "NetworkWordCount");
>>>
>>> JavaStreamingContext *jssc* = *new* JavaStreamingContext(conf,
>>> Durations.*seconds*(1));
>>>
>>> JavaReceiverInputDStream lines = jssc.socketTextStream(
>>> "localhost", );
>>>
>>> System.*out*.println("Successfully created connection");
>>>
>>> *mapAndReduce*(lines);
>>>
>>>  jssc.start(); // Start the computation
>>>
>>> jssc.awaitTermination(); // Wait for the computation to terminate
>>>
>>> }
>>>
>>> *public* *static* *void* main(String ...args){
>>>
>>> *doWork*(args[0]);
>>>
>>> }
>>> And output of the java program after submitting the task:
>>>
>>> java -cp .:* org.spark.test.WordCount spark://ip-10-241-251-232:7077
>>> Using Spark's default log4j profile:
>>> org/apache/spark/log4j-defaults.properties
>>> 15/03/27 13:50:46 INFO SecurityManager: Changing view acls to: ec2-user
>>> 15/03/27 13:50:46 INFO SecurityManager: Changing modify acls to: ec2-user
>>> 15/03/27 13:50:46 INFO SecurityManager: SecurityManager: authentication
>>> disabled; ui acls disabled; users with view permissions: Set(ec2-user);
>>> users with modify permissions: Set(ec2-user)
>>> 15/03/27 13:50:46 INFO Slf4jLogger: Slf4jLogger started
>>> 15/03/27 13:50:46 INFO Remoting: Starting remoting
>>> 15/03/27 13:50:47 INFO Remoting: Remoting started; listening on
>>> addresses
>>> :[akka.tcp://sparkdri...@ip-10-241-251-232.us-west-2.compute.internal
>>> :60184]
>>> 15/03/27 13:50:47 INFO Utils: Successfully started service 'sparkDriver'
>>> on port 60184.
>>> 15/03/27 13:50:47 INFO SparkEnv: Registering MapOutputTracker
>>> 15/03/27 13:50:47 INFO SparkEnv: Registering BlockManagerMaster
>>> 15/03/27 13:50:47 INFO DiskBlockManager: Created local directory at
>>> /tmp/spark-local-20150327135047-5399
>>> 15/03/27 13:50:47 INFO MemoryStore: MemoryStore started with capacity
>>> 3.5 GB
>>> 15/03/27 13:50:47 WARN NativeCodeLoader: Unable to load native-hadoop
>>> library for your platform... using builtin-java classes where applicable
>>> 15/03/27 13:50:47 INFO HttpFileServer: HTTP File server directory is
>>> /tmp/spark-7e26df49-1520-4c77-b411-c837da59fa5b
>>> 15/03/27 13:50:47 INFO HttpServer: Starting HTTP Server
>>> 15/03/27 13:50:47 INFO Utils: Successfully started service 'HTTP file
>>> server' on port 57955.
>>> 15/03/27 13:50:47 INFO Utils: Successfully started service 'SparkUI' on
>>> port 4040.
>>> 15/03/27 13:50:47 INFO SparkUI: Started SparkUI at
>>> http://ip-10-241-251-232.us-west-2.compute.internal:4040
>>> 15/03/27 13:50:47 INFO AppClient$ClientActor: Connecting to master
>>> spark://ip-10-241-251-232:7077...
>>> 15/03/27 13:50:48 INFO SparkDeploySchedulerBackend: Connected to Spark
>>> cluster with app ID app-20150327135048-0002
>>> 15/03/27 13:50:48 INFO NettyBlockTransferService: Server created on 58358
>>> 15/03/27 13:50:48 INFO BlockManagerMaster: Trying to register
>>> BlockManager
>>> 15/03/27 13:50:48 INFO BlockManagerMasterActor: Registering block
>>> manager ip-10-241-251-232.us-west-2.compute.internal:58358 with 3.5 GB RAM,
>>> BlockManagerId(, ip-10-241-251-232.us-west-2.compute.internal,
>>> 58358)
>>> 15/03/27 13:50:48 INFO BlockManagerMaster: Registered BlockManager
>>> 15/03/27 13:50:48 INFO SparkDeploySchedulerBackend: SchedulerBackend is
>>> ready for scheduling beginning after reached minRegisteredResour

RE: Tableau + Spark SQL Thrift Server + Cassandra

2015-04-03 Thread Mohammed Guller
Thanks, Todd.

It is an interesting idea; worth trying.

I think the cash project is old. The tuplejump guy has created another project 
called CalliopeServer2, which works like a charm with BI tools that use JDBC, 
but unfortunately Tableau throws an error when it connects to it.

Mohammed

From: Todd Nist [mailto:tsind...@gmail.com]
Sent: Friday, April 3, 2015 11:39 AM
To: pawan kumar
Cc: Mohammed Guller; user@spark.apache.org
Subject: Re: Tableau + Spark SQL Thrift Server + Cassandra

Hi Mohammed,

Not sure if you have tried this or not.  You could try using the below api to 
start the thriftserver with an existing context.

https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L42

The one thing that Michael Ambrust @ databrick recommended was this:
You can start a JDBC server with an existing context.  See my answer here: 
http://apache-spark-user-list.1001560.n3.nabble.com/Standard-SQL-tool-access-to-SchemaRDD-td20197.html

So something like this based on example from Cheng Lian:

Server

import  org.apache.spark.sql.hive.HiveContext

import  org.apache.spark.sql.catalyst.types._



val  sparkContext  =  sc

import  sparkContext._

val  sqlContext  =  new  HiveContext(sparkContext)

import  sqlContext._

makeRDD((1,"hello") :: (2,"world") 
::Nil).toSchemaRDD.cache().registerTempTable("t")

// replace the above with the C* + spark-casandra-connectore to generate 
SchemaRDD and registerTempTable



import  org.apache.spark.sql.hive.thriftserver._

HiveThriftServer2.startWithContext(sqlContext)
Then Startup

./bin/beeline -u jdbc:hive2://localhost:1/default

0: jdbc:hive2://localhost:1/default> select * from t;


I have not tried this yet from Tableau.   My understanding is that the 
tempTable is only valid as long as the sqlContext is, so if one terminates the 
code representing the Server, and then restarts the standard thrift server, 
sbin/start-thriftserver ..., the table won't be available.

Another possibility is to perhaps use the tuplejump cash project, 
https://github.com/tuplejump/cash.

HTH.

-Todd

On Fri, Apr 3, 2015 at 11:11 AM, pawan kumar 
mailto:pkv...@gmail.com>> wrote:

Thanks mohammed. Will give it a try today. We would also need the sparksSQL 
piece as we are migrating our data store from oracle to C* and it would be 
easier to maintain all the reports rather recreating each one from scratch.

Thanks,
Pawan Venugopal.
On Apr 3, 2015 7:59 AM, "Mohammed Guller" 
mailto:moham...@glassbeam.com>> wrote:
Hi Todd,

We are using Apache C* 2.1.3, not DSE. We got Tableau to work directly with C* 
using the ODBC driver, but now would like to add Spark SQL to the mix. I 
haven’t been able to find any documentation for how to make this combination 
work.

We are using the Spark-Cassandra-Connector in our applications, but haven’t 
been able to figure out how to get the Spark SQL Thrift Server to use it and 
connect to C*. That is the missing piece. Once we solve that piece of the 
puzzle then Tableau should be able to see the tables in C*.

Hi Pawan,
Tableau + C* is pretty straight forward, especially if you are using DSE. 
Create a new DSN in Tableau using the ODBC driver that comes with DSE. Once you 
connect, Tableau allows to use C* keyspace as schema and column families as 
tables.

Mohammed

From: pawan kumar [mailto:pkv...@gmail.com]
Sent: Friday, April 3, 2015 7:41 AM
To: Todd Nist
Cc: user@spark.apache.org; Mohammed Guller
Subject: Re: Tableau + Spark SQL Thrift Server + Cassandra


Hi Todd,

Thanks for the link. I would be interested in this solution. I am using DSE for 
cassandra. Would you provide me with info on connecting with DSE either through 
Tableau or zeppelin. The goal here is query cassandra through spark sql so that 
I could perform joins and groupby on my queries. Are you able to perform spark 
sql queries with tableau?

Thanks,
Pawan Venugopal
On Apr 3, 2015 5:03 AM, "Todd Nist" 
mailto:tsind...@gmail.com>> wrote:
What version of Cassandra are you using?  Are you using DSE or the stock Apache 
Cassandra version?  I have connected it with DSE, but have not attempted it 
with the standard Apache Cassandra version.

FWIW, 
http://www.datastax.com/dev/blog/datastax-odbc-cql-connector-apache-cassandra-datastax-enterprise,
 provides an ODBC driver tor accessing C* from Tableau.  Granted it does not 
provide all the goodness of Spark.  Are you attempting to leverage the 
spark-cassandra-connector for this?



On Thu, Apr 2, 2015 at 10:20 PM, Mohammed Guller 
mailto:moham...@glassbeam.com>> wrote:
Hi –

Is anybody using Tableau to analyze data in Cassandra through the Spark SQL 
Thrift Server?

Thanks!

Mohammed





Re: Regarding MLLIB sparse and dense matrix

2015-04-03 Thread Joseph Bradley
If you can examine your data matrix and know that about < 1/6 or so of the
values are non-zero (so > 5/6 are zeros), then it's probably worth using
sparse vectors.  (1/6 is a rough estimate.)

There is support for L1 and L2 regularization.  You can look at the guide
here:
http://spark.apache.org/docs/latest/mllib-linear-methods.html#logistic-regression
and the API docs linked from the menu.

On Fri, Apr 3, 2015 at 1:24 PM, Jeetendra Gangele 
wrote:

> Hi All
> I am building a logistic regression for matching the person data lets say
> two person object is given with their attribute we need to find the score.
> that means at side you have 10 millions records and other side we have 1
> record , we need to tell which one match with highest score among 1 million.
>
> I am strong the score of similarity algos in dense matrix and considering
> this as features. will apply many similarity alogs on one attributes.
>
> Should i use sparse or dense? what happen in dense when score is null or
> when some of the attribute is missing?
>
> is there any support for regularized logistic regression ?currently i am
> using LogisticRegressionWithSGD.
>
> Regards
> jeetendra
>


Re: WordCount example

2015-04-03 Thread Tathagata Das
How many cores are present in the works allocated to the standalone cluster
spark://ip-10-241-251-232:7077 ?


On Fri, Apr 3, 2015 at 2:18 PM, Mohit Anchlia 
wrote:

> If I use local[2] instead of *URL:* spark://ip-10-241-251-232:7077 this
> seems to work. I don't understand why though because when I
> give spark://ip-10-241-251-232:7077 application seem to bootstrap
> successfully, just doesn't create a socket on port ?
>
>
> On Fri, Mar 27, 2015 at 10:55 AM, Mohit Anchlia 
> wrote:
>
>> I checked the ports using netstat and don't see any connections
>> established on that port. Logs show only this:
>>
>> 15/03/27 13:50:48 INFO Master: Registering app NetworkWordCount
>> 15/03/27 13:50:48 INFO Master: Registered app NetworkWordCount with ID
>> app-20150327135048-0002
>>
>> Spark ui shows:
>>
>> Running Applications
>> IDNameCoresMemory per NodeSubmitted TimeUserStateDuration
>> app-20150327135048-0002
>> 
>> NetworkWordCount
>> 0512.0 
>> MB2015/03/27
>> 13:50:48ec2-userWAITING33 s
>> Code looks like is being executed:
>>
>> java -cp .:* org.spark.test.WordCount spark://ip-10-241-251-232:7077
>>
>> *public* *static* *void* doWork(String masterUrl){
>>
>> SparkConf conf = *new* SparkConf().setMaster(masterUrl).setAppName(
>> "NetworkWordCount");
>>
>> JavaStreamingContext *jssc* = *new* JavaStreamingContext(conf, Durations.
>> *seconds*(1));
>>
>> JavaReceiverInputDStream lines = jssc.socketTextStream(
>> "localhost", );
>>
>> System.*out*.println("Successfully created connection");
>>
>> *mapAndReduce*(lines);
>>
>>  jssc.start(); // Start the computation
>>
>> jssc.awaitTermination(); // Wait for the computation to terminate
>>
>> }
>>
>> *public* *static* *void* main(String ...args){
>>
>> *doWork*(args[0]);
>>
>> }
>> And output of the java program after submitting the task:
>>
>> java -cp .:* org.spark.test.WordCount spark://ip-10-241-251-232:7077
>> Using Spark's default log4j profile:
>> org/apache/spark/log4j-defaults.properties
>> 15/03/27 13:50:46 INFO SecurityManager: Changing view acls to: ec2-user
>> 15/03/27 13:50:46 INFO SecurityManager: Changing modify acls to: ec2-user
>> 15/03/27 13:50:46 INFO SecurityManager: SecurityManager: authentication
>> disabled; ui acls disabled; users with view permissions: Set(ec2-user);
>> users with modify permissions: Set(ec2-user)
>> 15/03/27 13:50:46 INFO Slf4jLogger: Slf4jLogger started
>> 15/03/27 13:50:46 INFO Remoting: Starting remoting
>> 15/03/27 13:50:47 INFO Remoting: Remoting started; listening on addresses
>> :[akka.tcp://sparkdri...@ip-10-241-251-232.us-west-2.compute.internal
>> :60184]
>> 15/03/27 13:50:47 INFO Utils: Successfully started service 'sparkDriver'
>> on port 60184.
>> 15/03/27 13:50:47 INFO SparkEnv: Registering MapOutputTracker
>> 15/03/27 13:50:47 INFO SparkEnv: Registering BlockManagerMaster
>> 15/03/27 13:50:47 INFO DiskBlockManager: Created local directory at
>> /tmp/spark-local-20150327135047-5399
>> 15/03/27 13:50:47 INFO MemoryStore: MemoryStore started with capacity 3.5
>> GB
>> 15/03/27 13:50:47 WARN NativeCodeLoader: Unable to load native-hadoop
>> library for your platform... using builtin-java classes where applicable
>> 15/03/27 13:50:47 INFO HttpFileServer: HTTP File server directory is
>> /tmp/spark-7e26df49-1520-4c77-b411-c837da59fa5b
>> 15/03/27 13:50:47 INFO HttpServer: Starting HTTP Server
>> 15/03/27 13:50:47 INFO Utils: Successfully started service 'HTTP file
>> server' on port 57955.
>> 15/03/27 13:50:47 INFO Utils: Successfully started service 'SparkUI' on
>> port 4040.
>> 15/03/27 13:50:47 INFO SparkUI: Started SparkUI at
>> http://ip-10-241-251-232.us-west-2.compute.internal:4040
>> 15/03/27 13:50:47 INFO AppClient$ClientActor: Connecting to master
>> spark://ip-10-241-251-232:7077...
>> 15/03/27 13:50:48 INFO SparkDeploySchedulerBackend: Connected to Spark
>> cluster with app ID app-20150327135048-0002
>> 15/03/27 13:50:48 INFO NettyBlockTransferService: Server created on 58358
>> 15/03/27 13:50:48 INFO BlockManagerMaster: Trying to register BlockManager
>> 15/03/27 13:50:48 INFO BlockManagerMasterActor: Registering block manager
>> ip-10-241-251-232.us-west-2.compute.internal:58358 with 3.5 GB RAM,
>> BlockManagerId(, ip-10-241-251-232.us-west-2.compute.internal,
>> 58358)
>> 15/03/27 13:50:48 INFO BlockManagerMaster: Registered BlockManager
>> 15/03/27 13:50:48 INFO SparkDeploySchedulerBackend: SchedulerBackend is
>> ready for scheduling beginning after reached minRegisteredResourcesRatio:
>> 0.0
>> 15/03/27 13:50:48 INFO ReceiverTracker: ReceiverTracker started
>> 15/03/27 13:50:48 INFO ForEachDStream: metadataCleanupDelay = -1
>> 15/03/27 13:50:48 INFO ShuffledDStream: metadataCleanupDelay = -1
>> 15/03/27 13:50:48 INFO MappedDStream: metadataCleanupDelay = -1
>> 15/03/27 13:50:48 INFO FlatMappedDStream: metadataCleanupDelay = -1
>> 15/03/27 13:50:48 INFO SocketInputDStream: 

Re: MLlib: save models to HDFS?

2015-04-03 Thread Xiangrui Meng
In 1.3, you can use model.save(sc, "hdfs path"). You can check the
code examples here:
http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html#examples.
-Xiangrui

On Fri, Apr 3, 2015 at 2:17 PM, Justin Yip  wrote:
> Hello Zhou,
>
> You can look at the recommendation template of PredictionIO. PredictionIO is
> built on the top of spark. And this template illustrates how you can save
> the ALS model to HDFS and the reload it later.
>
> Justin
>
>
> On Fri, Apr 3, 2015 at 9:16 AM, S. Zhou  wrote:
>>
>> I am new to MLib so I have a basic question: is it possible to save MLlib
>> models (particularly CF models) to HDFS and then reload it later? If yes,
>> could u share some sample code (I could not find it in MLlib tutorial).
>> Thanks!
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: WordCount example

2015-04-03 Thread Mohit Anchlia
If I use local[2] instead of *URL:* spark://ip-10-241-251-232:7077 this
seems to work. I don't understand why though because when I
give spark://ip-10-241-251-232:7077 application seem to bootstrap
successfully, just doesn't create a socket on port ?


On Fri, Mar 27, 2015 at 10:55 AM, Mohit Anchlia 
wrote:

> I checked the ports using netstat and don't see any connections
> established on that port. Logs show only this:
>
> 15/03/27 13:50:48 INFO Master: Registering app NetworkWordCount
> 15/03/27 13:50:48 INFO Master: Registered app NetworkWordCount with ID
> app-20150327135048-0002
>
> Spark ui shows:
>
> Running Applications
> IDNameCoresMemory per NodeSubmitted TimeUserStateDuration
> app-20150327135048-0002
> 
> NetworkWordCount
> 0512.0 MB2015/03/27
> 13:50:48ec2-userWAITING33 s
> Code looks like is being executed:
>
> java -cp .:* org.spark.test.WordCount spark://ip-10-241-251-232:7077
>
> *public* *static* *void* doWork(String masterUrl){
>
> SparkConf conf = *new* SparkConf().setMaster(masterUrl).setAppName(
> "NetworkWordCount");
>
> JavaStreamingContext *jssc* = *new* JavaStreamingContext(conf, Durations.
> *seconds*(1));
>
> JavaReceiverInputDStream lines = jssc.socketTextStream("localhost",
> );
>
> System.*out*.println("Successfully created connection");
>
> *mapAndReduce*(lines);
>
>  jssc.start(); // Start the computation
>
> jssc.awaitTermination(); // Wait for the computation to terminate
>
> }
>
> *public* *static* *void* main(String ...args){
>
> *doWork*(args[0]);
>
> }
> And output of the java program after submitting the task:
>
> java -cp .:* org.spark.test.WordCount spark://ip-10-241-251-232:7077
> Using Spark's default log4j profile:
> org/apache/spark/log4j-defaults.properties
> 15/03/27 13:50:46 INFO SecurityManager: Changing view acls to: ec2-user
> 15/03/27 13:50:46 INFO SecurityManager: Changing modify acls to: ec2-user
> 15/03/27 13:50:46 INFO SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled; users with view permissions: Set(ec2-user);
> users with modify permissions: Set(ec2-user)
> 15/03/27 13:50:46 INFO Slf4jLogger: Slf4jLogger started
> 15/03/27 13:50:46 INFO Remoting: Starting remoting
> 15/03/27 13:50:47 INFO Remoting: Remoting started; listening on addresses
> :[akka.tcp://sparkdri...@ip-10-241-251-232.us-west-2.compute.internal
> :60184]
> 15/03/27 13:50:47 INFO Utils: Successfully started service 'sparkDriver'
> on port 60184.
> 15/03/27 13:50:47 INFO SparkEnv: Registering MapOutputTracker
> 15/03/27 13:50:47 INFO SparkEnv: Registering BlockManagerMaster
> 15/03/27 13:50:47 INFO DiskBlockManager: Created local directory at
> /tmp/spark-local-20150327135047-5399
> 15/03/27 13:50:47 INFO MemoryStore: MemoryStore started with capacity 3.5
> GB
> 15/03/27 13:50:47 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 15/03/27 13:50:47 INFO HttpFileServer: HTTP File server directory is
> /tmp/spark-7e26df49-1520-4c77-b411-c837da59fa5b
> 15/03/27 13:50:47 INFO HttpServer: Starting HTTP Server
> 15/03/27 13:50:47 INFO Utils: Successfully started service 'HTTP file
> server' on port 57955.
> 15/03/27 13:50:47 INFO Utils: Successfully started service 'SparkUI' on
> port 4040.
> 15/03/27 13:50:47 INFO SparkUI: Started SparkUI at
> http://ip-10-241-251-232.us-west-2.compute.internal:4040
> 15/03/27 13:50:47 INFO AppClient$ClientActor: Connecting to master
> spark://ip-10-241-251-232:7077...
> 15/03/27 13:50:48 INFO SparkDeploySchedulerBackend: Connected to Spark
> cluster with app ID app-20150327135048-0002
> 15/03/27 13:50:48 INFO NettyBlockTransferService: Server created on 58358
> 15/03/27 13:50:48 INFO BlockManagerMaster: Trying to register BlockManager
> 15/03/27 13:50:48 INFO BlockManagerMasterActor: Registering block manager
> ip-10-241-251-232.us-west-2.compute.internal:58358 with 3.5 GB RAM,
> BlockManagerId(, ip-10-241-251-232.us-west-2.compute.internal,
> 58358)
> 15/03/27 13:50:48 INFO BlockManagerMaster: Registered BlockManager
> 15/03/27 13:50:48 INFO SparkDeploySchedulerBackend: SchedulerBackend is
> ready for scheduling beginning after reached minRegisteredResourcesRatio:
> 0.0
> 15/03/27 13:50:48 INFO ReceiverTracker: ReceiverTracker started
> 15/03/27 13:50:48 INFO ForEachDStream: metadataCleanupDelay = -1
> 15/03/27 13:50:48 INFO ShuffledDStream: metadataCleanupDelay = -1
> 15/03/27 13:50:48 INFO MappedDStream: metadataCleanupDelay = -1
> 15/03/27 13:50:48 INFO FlatMappedDStream: metadataCleanupDelay = -1
> 15/03/27 13:50:48 INFO SocketInputDStream: metadataCleanupDelay = -1
> 15/03/27 13:50:48 INFO SocketInputDStream: Slide time = 1000 ms
> 15/03/27 13:50:48 INFO SocketInputDStream: Storage level =
> StorageLevel(false, false, false, false, 1)
> 15/03/27 13:50:48 INFO SocketInputDStream: Checkpoint interval = null
> 15/03/27 13:5

Re: MLlib: save models to HDFS?

2015-04-03 Thread Justin Yip
Hello Zhou,

You can look at the recommendation template

of PredictionIO. PredictionIO is built on the top of spark. And this
template illustrates how you can save the ALS model to HDFS and the reload
it later.

Justin


On Fri, Apr 3, 2015 at 9:16 AM, S. Zhou  wrote:

> I am new to MLib so I have a basic question: is it possible to save MLlib
> models (particularly CF models) to HDFS and then reload it later? If yes,
> could u share some sample code (I could not find it in MLlib tutorial).
> Thanks!
>


Re: variant record by case classes in shell fails?

2015-04-03 Thread Michael Albert
My apologies for following my own post, but a friend just pointed out that if I 
use kryo with reference counting AND copy-and-paste, this runs.
However, if I try to "load ", this fails as described below.
I thought load was supposed to be equivalent?
Thanks!-Mike

  From: Michael Albert 
 To: User  
 Sent: Friday, April 3, 2015 2:45 PM
 Subject: variant record by case classes in shell fails?
   
Greetings!
For me, the code below fails from the shell.However, I can do essentially the 
same from compiled code, exporting the jar.
If I use default serialization or kryo with reference tracking, the error 
message tells me it can't find the constructor for "A".If I use kryo with 
reference tracking, I get a stack overflow.
I'm using Spark 1.2.1 on AWS EMR (hadoop 2.4).
I've also tried putting this code inside an object.
Is this just me?  Am I overlooking something obvious?
Thanks!
-Mike
:paste
sealed class AorBcase class A(i: Int) extends AorBcase class B(i: Int, j: Int) 
extends AorB
sc.parallelize(0.until(1)).map{ _ =>    val x = A(1)    x}.collect()



  

Regarding MLLIB sparse and dense matrix

2015-04-03 Thread Jeetendra Gangele
Hi All
I am building a logistic regression for matching the person data lets say
two person object is given with their attribute we need to find the score.
that means at side you have 10 millions records and other side we have 1
record , we need to tell which one match with highest score among 1 million.

I am strong the score of similarity algos in dense matrix and considering
this as features. will apply many similarity alogs on one attributes.

Should i use sparse or dense? what happen in dense when score is null or
when some of the attribute is missing?

is there any support for regularized logistic regression ?currently i am
using LogisticRegressionWithSGD.

Regards
jeetendra


Re: spark mesos deployment : starting workers based on attributes

2015-04-03 Thread Tim Chen
Hi Ankur,

There isn't a way to do that yet, but it's simple to add.

Can you create a JIRA in Spark for this?

Thanks!

Tim

On Fri, Apr 3, 2015 at 1:08 PM, Ankur Chauhan 
wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
>
> Hi,
>
> I am trying to figure out if there is a way to tell the mesos
> scheduler in spark to isolate the workers to a set of mesos slaves
> that have a given attribute such as `tachyon:true`.
>
> Anyone knows if that is possible or how I could achieve such a behavior.
>
> Thanks!
> - -- Ankur Chauhan
> -BEGIN PGP SIGNATURE-
>
> iQEcBAEBAgAGBQJVHvMlAAoJEOSJAMhvLp3LaV0H/jtX+KQDyorUESLIKIxFV9KM
> QjyPtVquwuZYcwLqCfQbo62RgE/LeTjjxzifTzMM5D6cf4ULBH1TcS3Is2EdOhSm
> UTMfJyvK06VFvYMLiGjqN4sBG3DFdamQif18qUJoKXX/Z9cUQO9SaSjIezSq2gd8
> 0lM3NLEQjsXY5uRJyl9GYDxcFsXPVzt1crXAdrtVsIYAlFmhcrm1n/5+Peix89Oh
> vgK1J7e0ei7Rc4/3BR2xr8f9us+Jfqym/xe+45h1YYZxZWrteCa48NOGixuUJjJe
> zb1MxNrTFZhPrKFT7pz9kCUZXl7DW5hzoQCH07CXZZI3B7kFS+5rjuEIB9qZXPE=
> =cadl
> -END PGP SIGNATURE-
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark + Kinesis

2015-04-03 Thread Daniil Osipov
Assembly settings have an option to exclude jars. You need something
similar to:
assemblyExcludedJars in assembly <<= (fullClasspath in assembly) map { cp =>
val excludes = Set(
  "minlog-1.2.jar"
)
cp filter { jar => excludes(jar.data.getName) }
  }

in your build file (may need to be refactored into a .scala file)

On Fri, Apr 3, 2015 at 12:57 PM, Vadim Bichutskiy <
vadim.bichuts...@gmail.com> wrote:

> Remove provided and got the following error:
>
> [error] (*:assembly) deduplicate: different file contents found in the
> following:
>
> [error]
> /Users/vb/.ivy2/cache/com.esotericsoftware.kryo/kryo/bundles/kryo-2.21.jar:com/esotericsoftware/minlog/Log$Logger.class
>
> [error]
> /Users/vb/.ivy2/cache/com.esotericsoftware.minlog/minlog/jars/minlog-1.2.jar:com/esotericsoftware/minlog/Log$Logger.class
> ᐧ
>
> On Fri, Apr 3, 2015 at 3:48 PM, Tathagata Das  wrote:
>
>> Just remove "provided" for spark-streaming-kinesis-asl
>>
>> libraryDependencies += "org.apache.spark" %%
>> "spark-streaming-kinesis-asl" % "1.3.0"
>>
>> On Fri, Apr 3, 2015 at 12:45 PM, Vadim Bichutskiy <
>> vadim.bichuts...@gmail.com> wrote:
>>
>>> Thanks. So how do I fix it?
>>> ᐧ
>>>
>>> On Fri, Apr 3, 2015 at 3:43 PM, Kelly, Jonathan 
>>> wrote:
>>>
   spark-streaming-kinesis-asl is not part of the Spark distribution on
 your cluster, so you cannot have it be just a "provided" dependency.  This
 is also why the KCL and its dependencies were not included in the assembly
 (but yes, they should be).


  ~ Jonathan Kelly

   From: Vadim Bichutskiy 
 Date: Friday, April 3, 2015 at 12:26 PM
 To: Jonathan Kelly 
 Cc: "user@spark.apache.org" 
 Subject: Re: Spark + Kinesis

   Hi all,

  Good news! I was able to create a Kinesis consumer and assemble it
 into an "uber jar" following
 http://spark.apache.org/docs/latest/streaming-kinesis-integration.html
 and example
 https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala
 .

  However when I try to spark-submit it I get the following exception:

  *Exception in thread "main" java.lang.NoClassDefFoundError:
 com/amazonaws/auth/AWSCredentialsProvider*

  Do I need to include KCL dependency in *build.sbt*, here's what it
 looks like currently:

  import AssemblyKeys._
 name := "Kinesis Consumer"
 version := "1.0"
 organization := "com.myconsumer"
 scalaVersion := "2.11.5"

  libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0" %
 "provided"
 libraryDependencies += "org.apache.spark" %% "spark-streaming" %
 "1.3.0" % "provided"
 libraryDependencies += "org.apache.spark" %%
 "spark-streaming-kinesis-asl" % "1.3.0" % "provided"

  assemblySettings
 jarName in assembly :=  "consumer-assembly.jar"
 assemblyOption in assembly := (assemblyOption in
 assembly).value.copy(includeScala=false)

  Any help appreciated.

  Thanks,
 Vadim

 On Thu, Apr 2, 2015 at 1:15 PM, Kelly, Jonathan 
 wrote:

>  It looks like you're attempting to mix Scala versions, so that's
> going to cause some problems.  If you really want to use Scala 2.11.5, you
> must also use Spark package versions built for Scala 2.11 rather than
> 2.10.  Anyway, that's not quite the correct way to specify Scala
> dependencies in build.sbt.  Instead of placing the Scala version after the
> artifactId (like "spark-core_2.10"), what you actually want is to use just
> "spark-core" with two percent signs before it.  Using two percent signs
> will make it use the version of Scala that matches your declared
> scalaVersion.  For example:
>
>  libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0"
> % "provided"
>
>  libraryDependencies += "org.apache.spark" %% "spark-streaming" %
> "1.3.0" % "provided"
>
>  libraryDependencies += "org.apache.spark" %%
> "spark-streaming-kinesis-asl" % "1.3.0"
>
>  I think that may get you a little closer, though I think you're
> probably going to run into the same problems I ran into in this thread:
> https://www.mail-archive.com/user@spark.apache.org/msg23891.html  I
> never really got an answer for that, and I temporarily moved on to other
> things for now.
>
>
>  ~ Jonathan Kelly
>
>   From: 'Vadim Bichutskiy' 
> Date: Thursday, April 2, 2015 at 9:53 AM
> To: "user@spark.apache.org" 
> Subject: Spark + Kinesis
>
>   Hi all,
>
>  I am trying to write an Amazon Kinesis consumer Scala app that
> processes data in the
> Kinesis stream. Is this the correct way to specify *build.sbt*:
>
>  ---
> *import AssemblyKeys._*
> *name := "Kinesis Consumer"*
>
>
>
>
>
>
> *version := "1.0"

Re: Spark Streaming FileStream Nested File Support

2015-04-03 Thread Adam Ritter
That doesn't seem like a good solution unfortunately as I would be needing
this to work in a production environment.  Do you know why the limitation
exists for FileInputDStream in the first place?  Unless I'm missing
something important about how some of the internals work I don't see why
this feature could be added in at some point.

On Fri, Apr 3, 2015 at 12:47 PM, Tathagata Das  wrote:

> I sort-a-hacky workaround is to use a queueStream where you can manually
> create RDDs (using sparkContext.hadoopFile) and insert into the queue. Note
> that this is for testing only as queueStream does not work with driver
> fautl recovery.
>
> TD
>
> On Fri, Apr 3, 2015 at 12:23 PM, adamgerst  wrote:
>
>> So after pulling my hair out for a bit trying to convert one of my
>> standard
>> spark jobs to streaming I found that FileInputDStream does not support
>> nested folders (see the brief mention here
>>
>> http://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources
>> the fileStream method returns a FileInputDStream).  So before, for my
>> standard job, I was reading from say
>>
>> s3n://mybucket/2015/03/02/*log
>>
>> And could also modify it to simply get an entire months worth of logs.
>> Since the logs are split up based upon their date, when the batch ran for
>> the day, I simply passed in a parameter of the date to make sure I was
>> reading the correct data
>>
>> But since I want to turn this job into a streaming job I need to simply do
>> something like
>>
>> s3n://mybucket/*log
>>
>> This would totally work fine if it were a standard spark application, but
>> fails for streaming.  Is there anyway I can get around this limitation?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-FileStream-Nested-File-Support-tp22370.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


spark mesos deployment : starting workers based on attributes

2015-04-03 Thread Ankur Chauhan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi,

I am trying to figure out if there is a way to tell the mesos
scheduler in spark to isolate the workers to a set of mesos slaves
that have a given attribute such as `tachyon:true`.

Anyone knows if that is possible or how I could achieve such a behavior.

Thanks!
- -- Ankur Chauhan
-BEGIN PGP SIGNATURE-

iQEcBAEBAgAGBQJVHvMlAAoJEOSJAMhvLp3LaV0H/jtX+KQDyorUESLIKIxFV9KM
QjyPtVquwuZYcwLqCfQbo62RgE/LeTjjxzifTzMM5D6cf4ULBH1TcS3Is2EdOhSm
UTMfJyvK06VFvYMLiGjqN4sBG3DFdamQif18qUJoKXX/Z9cUQO9SaSjIezSq2gd8
0lM3NLEQjsXY5uRJyl9GYDxcFsXPVzt1crXAdrtVsIYAlFmhcrm1n/5+Peix89Oh
vgK1J7e0ei7Rc4/3BR2xr8f9us+Jfqym/xe+45h1YYZxZWrteCa48NOGixuUJjJe
zb1MxNrTFZhPrKFT7pz9kCUZXl7DW5hzoQCH07CXZZI3B7kFS+5rjuEIB9qZXPE=
=cadl
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark + Kinesis

2015-04-03 Thread Vadim Bichutskiy
Remove provided and got the following error:

[error] (*:assembly) deduplicate: different file contents found in the
following:

[error]
/Users/vb/.ivy2/cache/com.esotericsoftware.kryo/kryo/bundles/kryo-2.21.jar:com/esotericsoftware/minlog/Log$Logger.class

[error]
/Users/vb/.ivy2/cache/com.esotericsoftware.minlog/minlog/jars/minlog-1.2.jar:com/esotericsoftware/minlog/Log$Logger.class
ᐧ

On Fri, Apr 3, 2015 at 3:48 PM, Tathagata Das  wrote:

> Just remove "provided" for spark-streaming-kinesis-asl
>
> libraryDependencies += "org.apache.spark" %% "spark-streaming-kinesis-asl"
> % "1.3.0"
>
> On Fri, Apr 3, 2015 at 12:45 PM, Vadim Bichutskiy <
> vadim.bichuts...@gmail.com> wrote:
>
>> Thanks. So how do I fix it?
>> ᐧ
>>
>> On Fri, Apr 3, 2015 at 3:43 PM, Kelly, Jonathan 
>> wrote:
>>
>>>   spark-streaming-kinesis-asl is not part of the Spark distribution on
>>> your cluster, so you cannot have it be just a "provided" dependency.  This
>>> is also why the KCL and its dependencies were not included in the assembly
>>> (but yes, they should be).
>>>
>>>
>>>  ~ Jonathan Kelly
>>>
>>>   From: Vadim Bichutskiy 
>>> Date: Friday, April 3, 2015 at 12:26 PM
>>> To: Jonathan Kelly 
>>> Cc: "user@spark.apache.org" 
>>> Subject: Re: Spark + Kinesis
>>>
>>>   Hi all,
>>>
>>>  Good news! I was able to create a Kinesis consumer and assemble it
>>> into an "uber jar" following
>>> http://spark.apache.org/docs/latest/streaming-kinesis-integration.html
>>> and example
>>> https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala
>>> .
>>>
>>>  However when I try to spark-submit it I get the following exception:
>>>
>>>  *Exception in thread "main" java.lang.NoClassDefFoundError:
>>> com/amazonaws/auth/AWSCredentialsProvider*
>>>
>>>  Do I need to include KCL dependency in *build.sbt*, here's what it
>>> looks like currently:
>>>
>>>  import AssemblyKeys._
>>> name := "Kinesis Consumer"
>>> version := "1.0"
>>> organization := "com.myconsumer"
>>> scalaVersion := "2.11.5"
>>>
>>>  libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0" %
>>> "provided"
>>> libraryDependencies += "org.apache.spark" %% "spark-streaming" % "1.3.0"
>>> % "provided"
>>> libraryDependencies += "org.apache.spark" %%
>>> "spark-streaming-kinesis-asl" % "1.3.0" % "provided"
>>>
>>>  assemblySettings
>>> jarName in assembly :=  "consumer-assembly.jar"
>>> assemblyOption in assembly := (assemblyOption in
>>> assembly).value.copy(includeScala=false)
>>>
>>>  Any help appreciated.
>>>
>>>  Thanks,
>>> Vadim
>>>
>>> On Thu, Apr 2, 2015 at 1:15 PM, Kelly, Jonathan 
>>> wrote:
>>>
  It looks like you're attempting to mix Scala versions, so that's
 going to cause some problems.  If you really want to use Scala 2.11.5, you
 must also use Spark package versions built for Scala 2.11 rather than
 2.10.  Anyway, that's not quite the correct way to specify Scala
 dependencies in build.sbt.  Instead of placing the Scala version after the
 artifactId (like "spark-core_2.10"), what you actually want is to use just
 "spark-core" with two percent signs before it.  Using two percent signs
 will make it use the version of Scala that matches your declared
 scalaVersion.  For example:

  libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0" %
 "provided"

  libraryDependencies += "org.apache.spark" %% "spark-streaming" %
 "1.3.0" % "provided"

  libraryDependencies += "org.apache.spark" %%
 "spark-streaming-kinesis-asl" % "1.3.0"

  I think that may get you a little closer, though I think you're
 probably going to run into the same problems I ran into in this thread:
 https://www.mail-archive.com/user@spark.apache.org/msg23891.html  I
 never really got an answer for that, and I temporarily moved on to other
 things for now.


  ~ Jonathan Kelly

   From: 'Vadim Bichutskiy' 
 Date: Thursday, April 2, 2015 at 9:53 AM
 To: "user@spark.apache.org" 
 Subject: Spark + Kinesis

   Hi all,

  I am trying to write an Amazon Kinesis consumer Scala app that
 processes data in the
 Kinesis stream. Is this the correct way to specify *build.sbt*:

  ---
 *import AssemblyKeys._*
 *name := "Kinesis Consumer"*






 *version := "1.0" organization := "com.myconsumer" scalaVersion :=
 "2.11.5" libraryDependencies ++= Seq("org.apache.spark" % "spark-core_2.10"
 % "1.3.0" % "provided", "org.apache.spark" % "spark-streaming_2.10" %
 "1.3.0" "org.apache.spark" % "spark-streaming-kinesis-asl_2.10" % "1.3.0")*



 * assemblySettings jarName in assembly :=  "consumer-assembly.jar"
 assemblyOption in assembly := (assemblyOption in
 assembly).value.copy(includeScala=false)*
 

  In *project/assembly.sbt* I have only the following line:

>>>

Re: Spark + Kinesis

2015-04-03 Thread Kelly, Jonathan
Just remove "provided" from the end of the line where you specify the 
spark-streaming-kinesis-asl dependency.  That will cause that package and all 
of its transitive dependencies (including the KCL, the AWS Java SDK libraries 
and other transitive dependencies) to be included in your "uber jar".  They all 
must be in there because they are not part of the Spark distribution in your 
cluster.

However, as I mentioned before, I think making this change might cause you to 
run into the same problems I spoke of in the thread I linked below 
(https://www.mail-archive.com/user@spark.apache.org/msg23891.html), and 
unfortunately I haven't solved that yet.

~ Jonathan Kelly

From: Vadim Bichutskiy 
mailto:vadim.bichuts...@gmail.com>>
Date: Friday, April 3, 2015 at 12:45 PM
To: Jonathan Kelly mailto:jonat...@amazon.com>>
Cc: "user@spark.apache.org" 
mailto:user@spark.apache.org>>
Subject: Re: Spark + Kinesis

Thanks. So how do I fix it?
[https://mailfoogae.appspot.com/t?sender=admFkaW0uYmljaHV0c2tpeUBnbWFpbC5jb20%3D&type=zerocontent&guid=51a86f6a-7130-4760-aab3-f4368d8176b9]ᐧ


On Fri, Apr 3, 2015 at 3:43 PM, Kelly, Jonathan 
mailto:jonat...@amazon.com>> wrote:
spark-streaming-kinesis-asl is not part of the Spark distribution on your 
cluster, so you cannot have it be just a "provided" dependency.  This is also 
why the KCL and its dependencies were not included in the assembly (but yes, 
they should be).

~ Jonathan Kelly

From: Vadim Bichutskiy 
mailto:vadim.bichuts...@gmail.com>>
Date: Friday, April 3, 2015 at 12:26 PM
To: Jonathan Kelly mailto:jonat...@amazon.com>>
Cc: "user@spark.apache.org" 
mailto:user@spark.apache.org>>
Subject: Re: Spark + Kinesis

Hi all,

Good news! I was able to create a Kinesis consumer and assemble it into an 
"uber jar" following 
http://spark.apache.org/docs/latest/streaming-kinesis-integration.html and 
example 
https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala.

However when I try to spark-submit it I get the following exception:

Exception in thread "main" java.lang.NoClassDefFoundError: 
com/amazonaws/auth/AWSCredentialsProvider

Do I need to include KCL dependency in build.sbt, here's what it looks like 
currently:

import AssemblyKeys._
name := "Kinesis Consumer"
version := "1.0"
organization := "com.myconsumer"
scalaVersion := "2.11.5"

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "1.3.0" % 
"provided"
libraryDependencies += "org.apache.spark" %% "spark-streaming-kinesis-asl" % 
"1.3.0" % "provided"

assemblySettings
jarName in assembly :=  "consumer-assembly.jar"
assemblyOption in assembly := (assemblyOption in 
assembly).value.copy(includeScala=false)

Any help appreciated.

Thanks,
Vadim

On Thu, Apr 2, 2015 at 1:15 PM, Kelly, Jonathan 
mailto:jonat...@amazon.com>> wrote:
It looks like you're attempting to mix Scala versions, so that's going to cause 
some problems.  If you really want to use Scala 2.11.5, you must also use Spark 
package versions built for Scala 2.11 rather than 2.10.  Anyway, that's not 
quite the correct way to specify Scala dependencies in build.sbt.  Instead of 
placing the Scala version after the artifactId (like "spark-core_2.10"), what 
you actually want is to use just "spark-core" with two percent signs before it. 
 Using two percent signs will make it use the version of Scala that matches 
your declared scalaVersion.  For example:

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0" % "provided"

libraryDependencies += "org.apache.spark" %% "spark-streaming" % "1.3.0" % 
"provided"

libraryDependencies += "org.apache.spark" %% "spark-streaming-kinesis-asl" % 
"1.3.0"

I think that may get you a little closer, though I think you're probably going 
to run into the same problems I ran into in this thread: 
https://www.mail-archive.com/user@spark.apache.org/msg23891.html  I never 
really got an answer for that, and I temporarily moved on to other things for 
now.

~ Jonathan Kelly

From: 'Vadim Bichutskiy' 
mailto:vadim.bichuts...@gmail.com>>
Date: Thursday, April 2, 2015 at 9:53 AM
To: "user@spark.apache.org" 
mailto:user@spark.apache.org>>
Subject: Spark + Kinesis

Hi all,

I am trying to write an Amazon Kinesis consumer Scala app that processes data 
in the
Kinesis stream. Is this the correct way to specify build.sbt:

---
import AssemblyKeys._
name := "Kinesis Consumer"
version := "1.0"
organization := "com.myconsumer"
scalaVersion := "2.11.5"

libraryDependencies ++= Seq("org.apache.spark" % "spark-core_2.10" % "1.3.0" % 
"provided",
"org.apache.spark" % "spark-streaming_2.10" % "1.3.0"
"org.apache.spark" % "spark-streaming-kinesis-asl_2.10" % "1.3.0")

assemblySettings
jarName in assembly :=  "consumer-assembly.jar"
assemblyOption in assembly := (

Re: Spark + Kinesis

2015-04-03 Thread Tathagata Das
Just remove "provided" for spark-streaming-kinesis-asl

libraryDependencies += "org.apache.spark" %% "spark-streaming-kinesis-asl"
% "1.3.0"

On Fri, Apr 3, 2015 at 12:45 PM, Vadim Bichutskiy <
vadim.bichuts...@gmail.com> wrote:

> Thanks. So how do I fix it?
> ᐧ
>
> On Fri, Apr 3, 2015 at 3:43 PM, Kelly, Jonathan 
> wrote:
>
>>   spark-streaming-kinesis-asl is not part of the Spark distribution on
>> your cluster, so you cannot have it be just a "provided" dependency.  This
>> is also why the KCL and its dependencies were not included in the assembly
>> (but yes, they should be).
>>
>>
>>  ~ Jonathan Kelly
>>
>>   From: Vadim Bichutskiy 
>> Date: Friday, April 3, 2015 at 12:26 PM
>> To: Jonathan Kelly 
>> Cc: "user@spark.apache.org" 
>> Subject: Re: Spark + Kinesis
>>
>>   Hi all,
>>
>>  Good news! I was able to create a Kinesis consumer and assemble it into
>> an "uber jar" following
>> http://spark.apache.org/docs/latest/streaming-kinesis-integration.html
>> and example
>> https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala
>> .
>>
>>  However when I try to spark-submit it I get the following exception:
>>
>>  *Exception in thread "main" java.lang.NoClassDefFoundError:
>> com/amazonaws/auth/AWSCredentialsProvider*
>>
>>  Do I need to include KCL dependency in *build.sbt*, here's what it
>> looks like currently:
>>
>>  import AssemblyKeys._
>> name := "Kinesis Consumer"
>> version := "1.0"
>> organization := "com.myconsumer"
>> scalaVersion := "2.11.5"
>>
>>  libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0" %
>> "provided"
>> libraryDependencies += "org.apache.spark" %% "spark-streaming" % "1.3.0"
>> % "provided"
>> libraryDependencies += "org.apache.spark" %%
>> "spark-streaming-kinesis-asl" % "1.3.0" % "provided"
>>
>>  assemblySettings
>> jarName in assembly :=  "consumer-assembly.jar"
>> assemblyOption in assembly := (assemblyOption in
>> assembly).value.copy(includeScala=false)
>>
>>  Any help appreciated.
>>
>>  Thanks,
>> Vadim
>>
>> On Thu, Apr 2, 2015 at 1:15 PM, Kelly, Jonathan 
>> wrote:
>>
>>>  It looks like you're attempting to mix Scala versions, so that's going
>>> to cause some problems.  If you really want to use Scala 2.11.5, you must
>>> also use Spark package versions built for Scala 2.11 rather than 2.10.
>>> Anyway, that's not quite the correct way to specify Scala dependencies in
>>> build.sbt.  Instead of placing the Scala version after the artifactId (like
>>> "spark-core_2.10"), what you actually want is to use just "spark-core" with
>>> two percent signs before it.  Using two percent signs will make it use the
>>> version of Scala that matches your declared scalaVersion.  For example:
>>>
>>>  libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0" %
>>> "provided"
>>>
>>>  libraryDependencies += "org.apache.spark" %% "spark-streaming" %
>>> "1.3.0" % "provided"
>>>
>>>  libraryDependencies += "org.apache.spark" %%
>>> "spark-streaming-kinesis-asl" % "1.3.0"
>>>
>>>  I think that may get you a little closer, though I think you're
>>> probably going to run into the same problems I ran into in this thread:
>>> https://www.mail-archive.com/user@spark.apache.org/msg23891.html  I
>>> never really got an answer for that, and I temporarily moved on to other
>>> things for now.
>>>
>>>
>>>  ~ Jonathan Kelly
>>>
>>>   From: 'Vadim Bichutskiy' 
>>> Date: Thursday, April 2, 2015 at 9:53 AM
>>> To: "user@spark.apache.org" 
>>> Subject: Spark + Kinesis
>>>
>>>   Hi all,
>>>
>>>  I am trying to write an Amazon Kinesis consumer Scala app that
>>> processes data in the
>>> Kinesis stream. Is this the correct way to specify *build.sbt*:
>>>
>>>  ---
>>> *import AssemblyKeys._*
>>> *name := "Kinesis Consumer"*
>>>
>>>
>>>
>>>
>>>
>>>
>>> *version := "1.0" organization := "com.myconsumer" scalaVersion :=
>>> "2.11.5" libraryDependencies ++= Seq("org.apache.spark" % "spark-core_2.10"
>>> % "1.3.0" % "provided", "org.apache.spark" % "spark-streaming_2.10" %
>>> "1.3.0" "org.apache.spark" % "spark-streaming-kinesis-asl_2.10" % "1.3.0")*
>>>
>>>
>>>
>>> * assemblySettings jarName in assembly :=  "consumer-assembly.jar"
>>> assemblyOption in assembly := (assemblyOption in
>>> assembly).value.copy(includeScala=false)*
>>> 
>>>
>>>  In *project/assembly.sbt* I have only the following line:
>>>
>>>  *addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.13.0")*
>>>
>>>  I am using sbt 0.13.7. I adapted Example 7.7 in the Learning Spark
>>> book.
>>>
>>>  Thanks,
>>> Vadim
>>>
>>>
>>
>


Re: Spark Streaming FileStream Nested File Support

2015-04-03 Thread Tathagata Das
I sort-a-hacky workaround is to use a queueStream where you can manually
create RDDs (using sparkContext.hadoopFile) and insert into the queue. Note
that this is for testing only as queueStream does not work with driver
fautl recovery.

TD

On Fri, Apr 3, 2015 at 12:23 PM, adamgerst  wrote:

> So after pulling my hair out for a bit trying to convert one of my standard
> spark jobs to streaming I found that FileInputDStream does not support
> nested folders (see the brief mention here
>
> http://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources
> the fileStream method returns a FileInputDStream).  So before, for my
> standard job, I was reading from say
>
> s3n://mybucket/2015/03/02/*log
>
> And could also modify it to simply get an entire months worth of logs.
> Since the logs are split up based upon their date, when the batch ran for
> the day, I simply passed in a parameter of the date to make sure I was
> reading the correct data
>
> But since I want to turn this job into a streaming job I need to simply do
> something like
>
> s3n://mybucket/*log
>
> This would totally work fine if it were a standard spark application, but
> fails for streaming.  Is there anyway I can get around this limitation?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-FileStream-Nested-File-Support-tp22370.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Spark TeraSort source request

2015-04-03 Thread Tom
Hi all,

As we all know, Spark has set the record for sorting data, as published on:
https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html.

Here at our group, we would love to verify these results, and compare
machine using this benchmark. We've spend quite some time trying to find the
terasort source code that was used, but can not find it anywhere.

We did find two candidates: 

A version posted by Reynold [1], the posted of the message above. This
version is stuck at "// TODO: Add partition-local (external) sorting
using TeraSortRecordOrdering", only generating data. 

Here, Ewan noticed that "it didn't appear to be similar to Hadoop TeraSort."
[2] After this he created a version on his own [3]. With this version, we
noticed problems with TeraValidate with datasets above ~10G (as mentioned by
others at [4]. When examining the raw input and output files, it actually
appears that the input data is sorted and the output data unsorted in both
cases. 

Because of this, we believe we did not yet find the actual used source code.
I've tried to search in the Spark User forum archive's, seeing request of
people, indicating a demand, but did not succeed in finding the actual
source code. 

My question:
Could you guys please make the source code of the used TeraSort program,
preferably with settings, available? If not, what are the reasons that this
seems to be withheld?

Thanks for any help,

Tom Hubregtsen 

[1]
https://github.com/rxin/spark/commit/adcae69145905162fa3b6932f70be2c932f95f87
[2]
http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/%3c5462092c.1060...@ugent.be%3E
[3] https://github.com/ehiggs/spark-terasort
[4]
http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CCAPszQwgap4o1inZkTwcwV=7scwoqtr5yxfnsqo5p2kgp1bn...@mail.gmail.com%3E



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-TeraSort-source-request-tp22371.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark + Kinesis

2015-04-03 Thread Vadim Bichutskiy
Thanks. So how do I fix it?
ᐧ

On Fri, Apr 3, 2015 at 3:43 PM, Kelly, Jonathan  wrote:

>   spark-streaming-kinesis-asl is not part of the Spark distribution on
> your cluster, so you cannot have it be just a "provided" dependency.  This
> is also why the KCL and its dependencies were not included in the assembly
> (but yes, they should be).
>
>
>  ~ Jonathan Kelly
>
>   From: Vadim Bichutskiy 
> Date: Friday, April 3, 2015 at 12:26 PM
> To: Jonathan Kelly 
> Cc: "user@spark.apache.org" 
> Subject: Re: Spark + Kinesis
>
>   Hi all,
>
>  Good news! I was able to create a Kinesis consumer and assemble it into
> an "uber jar" following
> http://spark.apache.org/docs/latest/streaming-kinesis-integration.html
> and example
> https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala
> .
>
>  However when I try to spark-submit it I get the following exception:
>
>  *Exception in thread "main" java.lang.NoClassDefFoundError:
> com/amazonaws/auth/AWSCredentialsProvider*
>
>  Do I need to include KCL dependency in *build.sbt*, here's what it looks
> like currently:
>
>  import AssemblyKeys._
> name := "Kinesis Consumer"
> version := "1.0"
> organization := "com.myconsumer"
> scalaVersion := "2.11.5"
>
>  libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0" %
> "provided"
> libraryDependencies += "org.apache.spark" %% "spark-streaming" % "1.3.0" %
> "provided"
> libraryDependencies += "org.apache.spark" %% "spark-streaming-kinesis-asl"
> % "1.3.0" % "provided"
>
>  assemblySettings
> jarName in assembly :=  "consumer-assembly.jar"
> assemblyOption in assembly := (assemblyOption in
> assembly).value.copy(includeScala=false)
>
>  Any help appreciated.
>
>  Thanks,
> Vadim
>
> On Thu, Apr 2, 2015 at 1:15 PM, Kelly, Jonathan 
> wrote:
>
>>  It looks like you're attempting to mix Scala versions, so that's going
>> to cause some problems.  If you really want to use Scala 2.11.5, you must
>> also use Spark package versions built for Scala 2.11 rather than 2.10.
>> Anyway, that's not quite the correct way to specify Scala dependencies in
>> build.sbt.  Instead of placing the Scala version after the artifactId (like
>> "spark-core_2.10"), what you actually want is to use just "spark-core" with
>> two percent signs before it.  Using two percent signs will make it use the
>> version of Scala that matches your declared scalaVersion.  For example:
>>
>>  libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0" %
>> "provided"
>>
>>  libraryDependencies += "org.apache.spark" %% "spark-streaming" %
>> "1.3.0" % "provided"
>>
>>  libraryDependencies += "org.apache.spark" %%
>> "spark-streaming-kinesis-asl" % "1.3.0"
>>
>>  I think that may get you a little closer, though I think you're
>> probably going to run into the same problems I ran into in this thread:
>> https://www.mail-archive.com/user@spark.apache.org/msg23891.html  I
>> never really got an answer for that, and I temporarily moved on to other
>> things for now.
>>
>>
>>  ~ Jonathan Kelly
>>
>>   From: 'Vadim Bichutskiy' 
>> Date: Thursday, April 2, 2015 at 9:53 AM
>> To: "user@spark.apache.org" 
>> Subject: Spark + Kinesis
>>
>>   Hi all,
>>
>>  I am trying to write an Amazon Kinesis consumer Scala app that
>> processes data in the
>> Kinesis stream. Is this the correct way to specify *build.sbt*:
>>
>>  ---
>> *import AssemblyKeys._*
>> *name := "Kinesis Consumer"*
>>
>>
>>
>>
>>
>>
>> *version := "1.0" organization := "com.myconsumer" scalaVersion :=
>> "2.11.5" libraryDependencies ++= Seq("org.apache.spark" % "spark-core_2.10"
>> % "1.3.0" % "provided", "org.apache.spark" % "spark-streaming_2.10" %
>> "1.3.0" "org.apache.spark" % "spark-streaming-kinesis-asl_2.10" % "1.3.0")*
>>
>>
>>
>> * assemblySettings jarName in assembly :=  "consumer-assembly.jar"
>> assemblyOption in assembly := (assemblyOption in
>> assembly).value.copy(includeScala=false)*
>> 
>>
>>  In *project/assembly.sbt* I have only the following line:
>>
>>  *addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.13.0")*
>>
>>  I am using sbt 0.13.7. I adapted Example 7.7 in the Learning Spark book.
>>
>>  Thanks,
>> Vadim
>>
>>
>


Re: Spark + Kinesis

2015-04-03 Thread Kelly, Jonathan
spark-streaming-kinesis-asl is not part of the Spark distribution on your 
cluster, so you cannot have it be just a "provided" dependency.  This is also 
why the KCL and its dependencies were not included in the assembly (but yes, 
they should be).

~ Jonathan Kelly

From: Vadim Bichutskiy 
mailto:vadim.bichuts...@gmail.com>>
Date: Friday, April 3, 2015 at 12:26 PM
To: Jonathan Kelly mailto:jonat...@amazon.com>>
Cc: "user@spark.apache.org" 
mailto:user@spark.apache.org>>
Subject: Re: Spark + Kinesis

Hi all,

Good news! I was able to create a Kinesis consumer and assemble it into an 
"uber jar" following 
http://spark.apache.org/docs/latest/streaming-kinesis-integration.html and 
example 
https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala.

However when I try to spark-submit it I get the following exception:

Exception in thread "main" java.lang.NoClassDefFoundError: 
com/amazonaws/auth/AWSCredentialsProvider

Do I need to include KCL dependency in build.sbt, here's what it looks like 
currently:

import AssemblyKeys._
name := "Kinesis Consumer"
version := "1.0"
organization := "com.myconsumer"
scalaVersion := "2.11.5"

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "1.3.0" % 
"provided"
libraryDependencies += "org.apache.spark" %% "spark-streaming-kinesis-asl" % 
"1.3.0" % "provided"

assemblySettings
jarName in assembly :=  "consumer-assembly.jar"
assemblyOption in assembly := (assemblyOption in 
assembly).value.copy(includeScala=false)

Any help appreciated.

Thanks,
Vadim
[https://mailfoogae.appspot.com/t?sender=admFkaW0uYmljaHV0c2tpeUBnbWFpbC5jb20%3D&type=zerocontent&guid=3d9e0d72-3cbe-4d6f-b262-829b92632515]ᐧ


On Thu, Apr 2, 2015 at 1:15 PM, Kelly, Jonathan 
mailto:jonat...@amazon.com>> wrote:
It looks like you're attempting to mix Scala versions, so that's going to cause 
some problems.  If you really want to use Scala 2.11.5, you must also use Spark 
package versions built for Scala 2.11 rather than 2.10.  Anyway, that's not 
quite the correct way to specify Scala dependencies in build.sbt.  Instead of 
placing the Scala version after the artifactId (like "spark-core_2.10"), what 
you actually want is to use just "spark-core" with two percent signs before it. 
 Using two percent signs will make it use the version of Scala that matches 
your declared scalaVersion.  For example:

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0" % "provided"

libraryDependencies += "org.apache.spark" %% "spark-streaming" % "1.3.0" % 
"provided"

libraryDependencies += "org.apache.spark" %% "spark-streaming-kinesis-asl" % 
"1.3.0"

I think that may get you a little closer, though I think you're probably going 
to run into the same problems I ran into in this thread: 
https://www.mail-archive.com/user@spark.apache.org/msg23891.html  I never 
really got an answer for that, and I temporarily moved on to other things for 
now.

~ Jonathan Kelly

From: 'Vadim Bichutskiy' 
mailto:vadim.bichuts...@gmail.com>>
Date: Thursday, April 2, 2015 at 9:53 AM
To: "user@spark.apache.org" 
mailto:user@spark.apache.org>>
Subject: Spark + Kinesis

Hi all,

I am trying to write an Amazon Kinesis consumer Scala app that processes data 
in the
Kinesis stream. Is this the correct way to specify build.sbt:

---
import AssemblyKeys._
name := "Kinesis Consumer"
version := "1.0"
organization := "com.myconsumer"
scalaVersion := "2.11.5"

libraryDependencies ++= Seq("org.apache.spark" % "spark-core_2.10" % "1.3.0" % 
"provided",
"org.apache.spark" % "spark-streaming_2.10" % "1.3.0"
"org.apache.spark" % "spark-streaming-kinesis-asl_2.10" % "1.3.0")

assemblySettings
jarName in assembly :=  "consumer-assembly.jar"
assemblyOption in assembly := (assemblyOption in 
assembly).value.copy(includeScala=false)


In project/assembly.sbt I have only the following line:

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.13.0")

I am using sbt 0.13.7. I adapted Example 7.7 in the Learning Spark book.

Thanks,
Vadim




Re: Simple but faster data streaming

2015-04-03 Thread Tathagata Das
I am afraid not. The whole point of Spark Streaming is to make it easy to
do complicated processing on streaming data while interoperating with core
Spark, MLlib, SQL without the operational overheads of maintain 4 different
systems. As a slight cost of achieving that unification, there maybe some
overheads compared to specialized systems that are designed to one specific
thing.

If you have to do something simple that could have been done using Flume,
then the resources needed by the Spark Streaming program shouldn't be too
high. Can you provide more details?

TD

On Thu, Apr 2, 2015 at 11:51 AM, Harut Martirosyan <
harut.martiros...@gmail.com> wrote:

> Hi guys.
>
> Is there a more lightweight way of stream processing with Spark? What we
> want is a simpler way, preferably with no scheduling, which just streams
> the data to destinations multiple.
>
> We extensively use Spark Core, SQL, Streaming, GraphX, so it's our main
> tool and don't want to add new things to the stack like Storm or Flume, but
> from other side, it really takes much more resources on same streaming than
> our previous setup with Flume, especially if we have multiple destinations
> (triggers multiple actions/scheduling)
>
>
> --
> RGRDZ Harut
>


Re: Tableau + Spark SQL Thrift Server + Cassandra

2015-04-03 Thread pawan kumar
@Todd,

I had looked at it yesterday. All these dependencies explained is added in
the DSE node. Do I need to include spark and DSE dependencies in the
Zeppline node?

I built zeppelin with no spark and no hadoop. To my understanding zeppelin
will send a request to a remote master at spark://
ec2-54-163-181-25.compute-1.amazonaws.com:7077  which is a dse box. This
box has all the dependencies. Do i need to install all the dependencies in
zeppline box.

Thanks,
Pawan Venugopal

On Fri, Apr 3, 2015 at 12:08 PM, Todd Nist  wrote:

> @Pawan,
>
> So it's been a couple of months since I have had a chance to do anything
> with Zeppelin, but here is a link to a post on what I did to get it working
> https://groups.google.com/forum/#!topic/zeppelin-developers/mCNdyOXNikI.
> This may or may not work with the newer releases from Zeppelin.
>
> -Todd
>
> On Fri, Apr 3, 2015 at 3:02 PM, pawan kumar  wrote:
>
>> Hi Todd,
>>
>> Thanks for the help. So i was able to get the DSE working with tableau as
>> per the link provided by Mohammed. Now i trying to figure out if i could
>> write sparksql queries from tableau and get data from DSE. My end goal is
>> to get a web based tool where i could write sql queries which will pull
>> data from cassandra.
>>
>> With Zeppelin I was able to build and run it in EC2 but not sure if
>> configurations are right. I am pointing to a spark master which is a remote
>> DSE node and all spark and sparksql dependencies are in the remote node. I
>> am not sure if i need to install spark and its dependencies in the webui
>> (zepplene) node.
>>
>> I am not sure talking about zepplelin in this thread is right.
>>
>> Thanks once again for all the help.
>>
>> Thanks,
>> Pawan Venugopal
>>
>>
>> On Fri, Apr 3, 2015 at 11:48 AM, Todd Nist  wrote:
>>
>>> @Pawan
>>>
>>> Not sure if you have seen this or not, but here is a good example by
>>> Jonathan Lacefield of Datastax's on hooking up sparksql with DSE, adding
>>> Tableau is as simple as Mohammed stated with DSE.
>>> https://github.com/jlacefie/sparksqltest.
>>>
>>> HTH,
>>> Todd
>>>
>>> On Fri, Apr 3, 2015 at 2:39 PM, Todd Nist  wrote:
>>>
 Hi Mohammed,

 Not sure if you have tried this or not.  You could try using the below
 api to start the thriftserver with an existing context.


 https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L42

 The one thing that Michael Ambrust @ databrick recommended was this:

> You can start a JDBC server with an existing context.  See my answer
> here:
> http://apache-spark-user-list.1001560.n3.nabble.com/Standard-SQL-tool-access-to-SchemaRDD-td20197.html

 So something like this based on example from Cheng Lian:

 *Server*

 import  org.apache.spark.sql.hive.HiveContext
 import  org.apache.spark.sql.catalyst.types._

 val  sparkContext  =  sc
 import  sparkContext._
 val  sqlContext  =  new  HiveContext(sparkContext)
 import  sqlContext._
 makeRDD((1,"hello") :: (2,"world") 
 ::Nil).toSchemaRDD.cache().registerTempTable("t")
 // replace the above with the C* + spark-casandra-connectore to generate 
 SchemaRDD and registerTempTable

 import  org.apache.spark.sql.hive.thriftserver._
 HiveThriftServer2.startWithContext(sqlContext)

 Then Startup

 ./bin/beeline -u jdbc:hive2://localhost:1/default
 0: jdbc:hive2://localhost:1/default> select * from t;


 I have not tried this yet from Tableau.   My understanding is that the
 tempTable is only valid as long as the sqlContext is, so if one terminates
 the code representing the *Server*, and then restarts the standard
 thrift server, sbin/start-thriftserver ..., the table won't be available.

 Another possibility is to perhaps use the tuplejump cash project,
 https://github.com/tuplejump/cash.

 HTH.

 -Todd

 On Fri, Apr 3, 2015 at 11:11 AM, pawan kumar  wrote:

> Thanks mohammed. Will give it a try today. We would also need the
> sparksSQL piece as we are migrating our data store from oracle to C* and 
> it
> would be easier to maintain all the reports rather recreating each one 
> from
> scratch.
>
> Thanks,
> Pawan Venugopal.
> On Apr 3, 2015 7:59 AM, "Mohammed Guller" 
> wrote:
>
>>  Hi Todd,
>>
>>
>>
>> We are using Apache C* 2.1.3, not DSE. We got Tableau to work
>> directly with C* using the ODBC driver, but now would like to add Spark 
>> SQL
>> to the mix. I haven’t been able to find any documentation for how to make
>> this combination work.
>>
>>
>>
>> We are using the Spark-Cassandra-Connector in our applications, but
>> haven’t been able to figure out how to get the Spark SQL Thrift Server to
>> use it and connect to C*. That is the missing pie

Re: Spark + Kinesis

2015-04-03 Thread Vadim Bichutskiy
Hi all,

Good news! I was able to create a Kinesis consumer and assemble it into an
"uber jar" following
http://spark.apache.org/docs/latest/streaming-kinesis-integration.html and
example
https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala
.

However when I try to spark-submit it I get the following exception:

*Exception in thread "main" java.lang.NoClassDefFoundError:
com/amazonaws/auth/AWSCredentialsProvider*

Do I need to include KCL dependency in *build.sbt*, here's what it looks
like currently:

import AssemblyKeys._
name := "Kinesis Consumer"
version := "1.0"
organization := "com.myconsumer"
scalaVersion := "2.11.5"

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0" %
"provided"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "1.3.0" %
"provided"
libraryDependencies += "org.apache.spark" %% "spark-streaming-kinesis-asl"
% "1.3.0" % "provided"

assemblySettings
jarName in assembly :=  "consumer-assembly.jar"
assemblyOption in assembly := (assemblyOption in
assembly).value.copy(includeScala=false)

Any help appreciated.

Thanks,
Vadim
ᐧ

On Thu, Apr 2, 2015 at 1:15 PM, Kelly, Jonathan  wrote:

>   It looks like you're attempting to mix Scala versions, so that's going
> to cause some problems.  If you really want to use Scala 2.11.5, you must
> also use Spark package versions built for Scala 2.11 rather than 2.10.
> Anyway, that's not quite the correct way to specify Scala dependencies in
> build.sbt.  Instead of placing the Scala version after the artifactId (like
> "spark-core_2.10"), what you actually want is to use just "spark-core" with
> two percent signs before it.  Using two percent signs will make it use the
> version of Scala that matches your declared scalaVersion.  For example:
>
>  libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0" %
> "provided"
>
>  libraryDependencies += "org.apache.spark" %% "spark-streaming" %
> "1.3.0" % "provided"
>
>  libraryDependencies += "org.apache.spark" %%
> "spark-streaming-kinesis-asl" % "1.3.0"
>
>  I think that may get you a little closer, though I think you're probably
> going to run into the same problems I ran into in this thread:
> https://www.mail-archive.com/user@spark.apache.org/msg23891.html  I never
> really got an answer for that, and I temporarily moved on to other things
> for now.
>
>
>  ~ Jonathan Kelly
>
>   From: 'Vadim Bichutskiy' 
> Date: Thursday, April 2, 2015 at 9:53 AM
> To: "user@spark.apache.org" 
> Subject: Spark + Kinesis
>
>   Hi all,
>
>  I am trying to write an Amazon Kinesis consumer Scala app that processes
> data in the
> Kinesis stream. Is this the correct way to specify *build.sbt*:
>
>  ---
> *import AssemblyKeys._*
> *name := "Kinesis Consumer"*
>
>
>
>
>
>
> *version := "1.0" organization := "com.myconsumer" scalaVersion :=
> "2.11.5" libraryDependencies ++= Seq("org.apache.spark" % "spark-core_2.10"
> % "1.3.0" % "provided", "org.apache.spark" % "spark-streaming_2.10" %
> "1.3.0" "org.apache.spark" % "spark-streaming-kinesis-asl_2.10" % "1.3.0")*
>
>
>
> * assemblySettings jarName in assembly :=  "consumer-assembly.jar"
> assemblyOption in assembly := (assemblyOption in
> assembly).value.copy(includeScala=false)*
> 
>
>  In *project/assembly.sbt* I have only the following line:
>
>  *addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.13.0")*
>
>  I am using sbt 0.13.7. I adapted Example 7.7 in the Learning Spark book.
>
>  Thanks,
> Vadim
>
>


Spark Streaming FileStream Nested File Support

2015-04-03 Thread adamgerst
So after pulling my hair out for a bit trying to convert one of my standard
spark jobs to streaming I found that FileInputDStream does not support
nested folders (see the brief mention here
http://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources
the fileStream method returns a FileInputDStream).  So before, for my
standard job, I was reading from say

s3n://mybucket/2015/03/02/*log

And could also modify it to simply get an entire months worth of logs. 
Since the logs are split up based upon their date, when the batch ran for
the day, I simply passed in a parameter of the date to make sure I was
reading the correct data

But since I want to turn this job into a streaming job I need to simply do
something like

s3n://mybucket/*log

This would totally work fine if it were a standard spark application, but
fails for streaming.  Is there anyway I can get around this limitation?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-FileStream-Nested-File-Support-tp22370.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Tableau + Spark SQL Thrift Server + Cassandra

2015-04-03 Thread Todd Nist
@Pawan,

So it's been a couple of months since I have had a chance to do anything
with Zeppelin, but here is a link to a post on what I did to get it working
https://groups.google.com/forum/#!topic/zeppelin-developers/mCNdyOXNikI.
This may or may not work with the newer releases from Zeppelin.

-Todd

On Fri, Apr 3, 2015 at 3:02 PM, pawan kumar  wrote:

> Hi Todd,
>
> Thanks for the help. So i was able to get the DSE working with tableau as
> per the link provided by Mohammed. Now i trying to figure out if i could
> write sparksql queries from tableau and get data from DSE. My end goal is
> to get a web based tool where i could write sql queries which will pull
> data from cassandra.
>
> With Zeppelin I was able to build and run it in EC2 but not sure if
> configurations are right. I am pointing to a spark master which is a remote
> DSE node and all spark and sparksql dependencies are in the remote node. I
> am not sure if i need to install spark and its dependencies in the webui
> (zepplene) node.
>
> I am not sure talking about zepplelin in this thread is right.
>
> Thanks once again for all the help.
>
> Thanks,
> Pawan Venugopal
>
>
> On Fri, Apr 3, 2015 at 11:48 AM, Todd Nist  wrote:
>
>> @Pawan
>>
>> Not sure if you have seen this or not, but here is a good example by
>> Jonathan Lacefield of Datastax's on hooking up sparksql with DSE, adding
>> Tableau is as simple as Mohammed stated with DSE.
>> https://github.com/jlacefie/sparksqltest.
>>
>> HTH,
>> Todd
>>
>> On Fri, Apr 3, 2015 at 2:39 PM, Todd Nist  wrote:
>>
>>> Hi Mohammed,
>>>
>>> Not sure if you have tried this or not.  You could try using the below
>>> api to start the thriftserver with an existing context.
>>>
>>>
>>> https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L42
>>>
>>> The one thing that Michael Ambrust @ databrick recommended was this:
>>>
 You can start a JDBC server with an existing context.  See my answer
 here:
 http://apache-spark-user-list.1001560.n3.nabble.com/Standard-SQL-tool-access-to-SchemaRDD-td20197.html
>>>
>>> So something like this based on example from Cheng Lian:
>>>
>>> *Server*
>>>
>>> import  org.apache.spark.sql.hive.HiveContext
>>> import  org.apache.spark.sql.catalyst.types._
>>>
>>> val  sparkContext  =  sc
>>> import  sparkContext._
>>> val  sqlContext  =  new  HiveContext(sparkContext)
>>> import  sqlContext._
>>> makeRDD((1,"hello") :: (2,"world") 
>>> ::Nil).toSchemaRDD.cache().registerTempTable("t")
>>> // replace the above with the C* + spark-casandra-connectore to generate 
>>> SchemaRDD and registerTempTable
>>>
>>> import  org.apache.spark.sql.hive.thriftserver._
>>> HiveThriftServer2.startWithContext(sqlContext)
>>>
>>> Then Startup
>>>
>>> ./bin/beeline -u jdbc:hive2://localhost:1/default
>>> 0: jdbc:hive2://localhost:1/default> select * from t;
>>>
>>>
>>> I have not tried this yet from Tableau.   My understanding is that the
>>> tempTable is only valid as long as the sqlContext is, so if one terminates
>>> the code representing the *Server*, and then restarts the standard
>>> thrift server, sbin/start-thriftserver ..., the table won't be available.
>>>
>>> Another possibility is to perhaps use the tuplejump cash project,
>>> https://github.com/tuplejump/cash.
>>>
>>> HTH.
>>>
>>> -Todd
>>>
>>> On Fri, Apr 3, 2015 at 11:11 AM, pawan kumar  wrote:
>>>
 Thanks mohammed. Will give it a try today. We would also need the
 sparksSQL piece as we are migrating our data store from oracle to C* and it
 would be easier to maintain all the reports rather recreating each one from
 scratch.

 Thanks,
 Pawan Venugopal.
 On Apr 3, 2015 7:59 AM, "Mohammed Guller" 
 wrote:

>  Hi Todd,
>
>
>
> We are using Apache C* 2.1.3, not DSE. We got Tableau to work directly
> with C* using the ODBC driver, but now would like to add Spark SQL to the
> mix. I haven’t been able to find any documentation for how to make this
> combination work.
>
>
>
> We are using the Spark-Cassandra-Connector in our applications, but
> haven’t been able to figure out how to get the Spark SQL Thrift Server to
> use it and connect to C*. That is the missing piece. Once we solve that
> piece of the puzzle then Tableau should be able to see the tables in C*.
>
>
>
> Hi Pawan,
>
> Tableau + C* is pretty straight forward, especially if you are using
> DSE. Create a new DSN in Tableau using the ODBC driver that comes with 
> DSE.
> Once you connect, Tableau allows to use C* keyspace as schema and column
> families as tables.
>
>
>
> Mohammed
>
>
>
> *From:* pawan kumar [mailto:pkv...@gmail.com]
> *Sent:* Friday, April 3, 2015 7:41 AM
> *To:* Todd Nist
> *Cc:* user@spark.apache.org; Mohammed Guller
> *Subject:* Re: Tableau + Spark SQL Thrift Server +

Re: Tableau + Spark SQL Thrift Server + Cassandra

2015-04-03 Thread pawan kumar
Hi Todd,

Thanks for the help. So i was able to get the DSE working with tableau as
per the link provided by Mohammed. Now i trying to figure out if i could
write sparksql queries from tableau and get data from DSE. My end goal is
to get a web based tool where i could write sql queries which will pull
data from cassandra.

With Zeppelin I was able to build and run it in EC2 but not sure if
configurations are right. I am pointing to a spark master which is a remote
DSE node and all spark and sparksql dependencies are in the remote node. I
am not sure if i need to install spark and its dependencies in the webui
(zepplene) node.

I am not sure talking about zepplelin in this thread is right.

Thanks once again for all the help.

Thanks,
Pawan Venugopal


On Fri, Apr 3, 2015 at 11:48 AM, Todd Nist  wrote:

> @Pawan
>
> Not sure if you have seen this or not, but here is a good example by
> Jonathan Lacefield of Datastax's on hooking up sparksql with DSE, adding
> Tableau is as simple as Mohammed stated with DSE.
> https://github.com/jlacefie/sparksqltest.
>
> HTH,
> Todd
>
> On Fri, Apr 3, 2015 at 2:39 PM, Todd Nist  wrote:
>
>> Hi Mohammed,
>>
>> Not sure if you have tried this or not.  You could try using the below
>> api to start the thriftserver with an existing context.
>>
>>
>> https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L42
>>
>> The one thing that Michael Ambrust @ databrick recommended was this:
>>
>>> You can start a JDBC server with an existing context.  See my answer
>>> here:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Standard-SQL-tool-access-to-SchemaRDD-td20197.html
>>
>> So something like this based on example from Cheng Lian:
>>
>> *Server*
>>
>> import  org.apache.spark.sql.hive.HiveContext
>> import  org.apache.spark.sql.catalyst.types._
>>
>> val  sparkContext  =  sc
>> import  sparkContext._
>> val  sqlContext  =  new  HiveContext(sparkContext)
>> import  sqlContext._
>> makeRDD((1,"hello") :: (2,"world") 
>> ::Nil).toSchemaRDD.cache().registerTempTable("t")
>> // replace the above with the C* + spark-casandra-connectore to generate 
>> SchemaRDD and registerTempTable
>>
>> import  org.apache.spark.sql.hive.thriftserver._
>> HiveThriftServer2.startWithContext(sqlContext)
>>
>> Then Startup
>>
>> ./bin/beeline -u jdbc:hive2://localhost:1/default
>> 0: jdbc:hive2://localhost:1/default> select * from t;
>>
>>
>> I have not tried this yet from Tableau.   My understanding is that the
>> tempTable is only valid as long as the sqlContext is, so if one terminates
>> the code representing the *Server*, and then restarts the standard
>> thrift server, sbin/start-thriftserver ..., the table won't be available.
>>
>> Another possibility is to perhaps use the tuplejump cash project,
>> https://github.com/tuplejump/cash.
>>
>> HTH.
>>
>> -Todd
>>
>> On Fri, Apr 3, 2015 at 11:11 AM, pawan kumar  wrote:
>>
>>> Thanks mohammed. Will give it a try today. We would also need the
>>> sparksSQL piece as we are migrating our data store from oracle to C* and it
>>> would be easier to maintain all the reports rather recreating each one from
>>> scratch.
>>>
>>> Thanks,
>>> Pawan Venugopal.
>>> On Apr 3, 2015 7:59 AM, "Mohammed Guller" 
>>> wrote:
>>>
  Hi Todd,



 We are using Apache C* 2.1.3, not DSE. We got Tableau to work directly
 with C* using the ODBC driver, but now would like to add Spark SQL to the
 mix. I haven’t been able to find any documentation for how to make this
 combination work.



 We are using the Spark-Cassandra-Connector in our applications, but
 haven’t been able to figure out how to get the Spark SQL Thrift Server to
 use it and connect to C*. That is the missing piece. Once we solve that
 piece of the puzzle then Tableau should be able to see the tables in C*.



 Hi Pawan,

 Tableau + C* is pretty straight forward, especially if you are using
 DSE. Create a new DSN in Tableau using the ODBC driver that comes with DSE.
 Once you connect, Tableau allows to use C* keyspace as schema and column
 families as tables.



 Mohammed



 *From:* pawan kumar [mailto:pkv...@gmail.com]
 *Sent:* Friday, April 3, 2015 7:41 AM
 *To:* Todd Nist
 *Cc:* user@spark.apache.org; Mohammed Guller
 *Subject:* Re: Tableau + Spark SQL Thrift Server + Cassandra



 Hi Todd,

 Thanks for the link. I would be interested in this solution. I am using
 DSE for cassandra. Would you provide me with info on connecting with DSE
 either through Tableau or zeppelin. The goal here is query cassandra
 through spark sql so that I could perform joins and groupby on my queries.
 Are you able to perform spark sql queries with tableau?

 Thanks,
 Pawan Venugopal

 On Apr 3, 2015 5:03 AM, "Todd Nist"  wrote:

>

Re: Tableau + Spark SQL Thrift Server + Cassandra

2015-04-03 Thread Todd Nist
@Pawan

Not sure if you have seen this or not, but here is a good example by
Jonathan Lacefield of Datastax's on hooking up sparksql with DSE, adding
Tableau is as simple as Mohammed stated with DSE.
https://github.com/jlacefie/sparksqltest.

HTH,
Todd

On Fri, Apr 3, 2015 at 2:39 PM, Todd Nist  wrote:

> Hi Mohammed,
>
> Not sure if you have tried this or not.  You could try using the below api
> to start the thriftserver with an existing context.
>
>
> https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L42
>
> The one thing that Michael Ambrust @ databrick recommended was this:
>
>> You can start a JDBC server with an existing context.  See my answer
>> here:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Standard-SQL-tool-access-to-SchemaRDD-td20197.html
>
> So something like this based on example from Cheng Lian:
>
> *Server*
>
> import  org.apache.spark.sql.hive.HiveContext
> import  org.apache.spark.sql.catalyst.types._
>
> val  sparkContext  =  sc
> import  sparkContext._
> val  sqlContext  =  new  HiveContext(sparkContext)
> import  sqlContext._
> makeRDD((1,"hello") :: (2,"world") 
> ::Nil).toSchemaRDD.cache().registerTempTable("t")
> // replace the above with the C* + spark-casandra-connectore to generate 
> SchemaRDD and registerTempTable
>
> import  org.apache.spark.sql.hive.thriftserver._
> HiveThriftServer2.startWithContext(sqlContext)
>
> Then Startup
>
> ./bin/beeline -u jdbc:hive2://localhost:1/default
> 0: jdbc:hive2://localhost:1/default> select * from t;
>
>
> I have not tried this yet from Tableau.   My understanding is that the
> tempTable is only valid as long as the sqlContext is, so if one terminates
> the code representing the *Server*, and then restarts the standard thrift
> server, sbin/start-thriftserver ..., the table won't be available.
>
> Another possibility is to perhaps use the tuplejump cash project,
> https://github.com/tuplejump/cash.
>
> HTH.
>
> -Todd
>
> On Fri, Apr 3, 2015 at 11:11 AM, pawan kumar  wrote:
>
>> Thanks mohammed. Will give it a try today. We would also need the
>> sparksSQL piece as we are migrating our data store from oracle to C* and it
>> would be easier to maintain all the reports rather recreating each one from
>> scratch.
>>
>> Thanks,
>> Pawan Venugopal.
>> On Apr 3, 2015 7:59 AM, "Mohammed Guller"  wrote:
>>
>>>  Hi Todd,
>>>
>>>
>>>
>>> We are using Apache C* 2.1.3, not DSE. We got Tableau to work directly
>>> with C* using the ODBC driver, but now would like to add Spark SQL to the
>>> mix. I haven’t been able to find any documentation for how to make this
>>> combination work.
>>>
>>>
>>>
>>> We are using the Spark-Cassandra-Connector in our applications, but
>>> haven’t been able to figure out how to get the Spark SQL Thrift Server to
>>> use it and connect to C*. That is the missing piece. Once we solve that
>>> piece of the puzzle then Tableau should be able to see the tables in C*.
>>>
>>>
>>>
>>> Hi Pawan,
>>>
>>> Tableau + C* is pretty straight forward, especially if you are using
>>> DSE. Create a new DSN in Tableau using the ODBC driver that comes with DSE.
>>> Once you connect, Tableau allows to use C* keyspace as schema and column
>>> families as tables.
>>>
>>>
>>>
>>> Mohammed
>>>
>>>
>>>
>>> *From:* pawan kumar [mailto:pkv...@gmail.com]
>>> *Sent:* Friday, April 3, 2015 7:41 AM
>>> *To:* Todd Nist
>>> *Cc:* user@spark.apache.org; Mohammed Guller
>>> *Subject:* Re: Tableau + Spark SQL Thrift Server + Cassandra
>>>
>>>
>>>
>>> Hi Todd,
>>>
>>> Thanks for the link. I would be interested in this solution. I am using
>>> DSE for cassandra. Would you provide me with info on connecting with DSE
>>> either through Tableau or zeppelin. The goal here is query cassandra
>>> through spark sql so that I could perform joins and groupby on my queries.
>>> Are you able to perform spark sql queries with tableau?
>>>
>>> Thanks,
>>> Pawan Venugopal
>>>
>>> On Apr 3, 2015 5:03 AM, "Todd Nist"  wrote:
>>>
>>> What version of Cassandra are you using?  Are you using DSE or the stock
>>> Apache Cassandra version?  I have connected it with DSE, but have not
>>> attempted it with the standard Apache Cassandra version.
>>>
>>>
>>>
>>> FWIW,
>>> http://www.datastax.com/dev/blog/datastax-odbc-cql-connector-apache-cassandra-datastax-enterprise,
>>> provides an ODBC driver tor accessing C* from Tableau.  Granted it does not
>>> provide all the goodness of Spark.  Are you attempting to leverage the
>>> spark-cassandra-connector for this?
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Apr 2, 2015 at 10:20 PM, Mohammed Guller 
>>> wrote:
>>>
>>> Hi –
>>>
>>>
>>>
>>> Is anybody using Tableau to analyze data in Cassandra through the Spark
>>> SQL Thrift Server?
>>>
>>>
>>>
>>> Thanks!
>>>
>>>
>>>
>>> Mohammed
>>>
>>>
>>>
>>>
>>>
>>
>


variant record by case classes in shell fails?

2015-04-03 Thread Michael Albert
Greetings!
For me, the code below fails from the shell.However, I can do essentially the 
same from compiled code, exporting the jar.
If I use default serialization or kryo with reference tracking, the error 
message tells me it can't find the constructor for "A".If I use kryo with 
reference tracking, I get a stack overflow.
I'm using Spark 1.2.1 on AWS EMR (hadoop 2.4).
I've also tried putting this code inside an object.
Is this just me?  Am I overlooking something obvious?
Thanks!
-Mike
:paste
sealed class AorBcase class A(i: Int) extends AorBcase class B(i: Int, j: Int) 
extends AorB
sc.parallelize(0.until(1)).map{ _ =>    val x = A(1)    x}.collect()



Re: Tableau + Spark SQL Thrift Server + Cassandra

2015-04-03 Thread Todd Nist
Hi Mohammed,

Not sure if you have tried this or not.  You could try using the below api
to start the thriftserver with an existing context.

https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L42

The one thing that Michael Ambrust @ databrick recommended was this:

> You can start a JDBC server with an existing context.  See my answer here:
> http://apache-spark-user-list.1001560.n3.nabble.com/Standard-SQL-tool-access-to-SchemaRDD-td20197.html

So something like this based on example from Cheng Lian:

*Server*

import  org.apache.spark.sql.hive.HiveContext
import  org.apache.spark.sql.catalyst.types._

val  sparkContext  =  sc
import  sparkContext._
val  sqlContext  =  new  HiveContext(sparkContext)
import  sqlContext._
makeRDD((1,"hello") :: (2,"world")
::Nil).toSchemaRDD.cache().registerTempTable("t")
// replace the above with the C* + spark-casandra-connectore to
generate SchemaRDD and registerTempTable

import  org.apache.spark.sql.hive.thriftserver._
HiveThriftServer2.startWithContext(sqlContext)

Then Startup

./bin/beeline -u jdbc:hive2://localhost:1/default
0: jdbc:hive2://localhost:1/default> select * from t;


I have not tried this yet from Tableau.   My understanding is that the
tempTable is only valid as long as the sqlContext is, so if one terminates
the code representing the *Server*, and then restarts the standard thrift
server, sbin/start-thriftserver ..., the table won't be available.

Another possibility is to perhaps use the tuplejump cash project,
https://github.com/tuplejump/cash.

HTH.

-Todd

On Fri, Apr 3, 2015 at 11:11 AM, pawan kumar  wrote:

> Thanks mohammed. Will give it a try today. We would also need the
> sparksSQL piece as we are migrating our data store from oracle to C* and it
> would be easier to maintain all the reports rather recreating each one from
> scratch.
>
> Thanks,
> Pawan Venugopal.
> On Apr 3, 2015 7:59 AM, "Mohammed Guller"  wrote:
>
>>  Hi Todd,
>>
>>
>>
>> We are using Apache C* 2.1.3, not DSE. We got Tableau to work directly
>> with C* using the ODBC driver, but now would like to add Spark SQL to the
>> mix. I haven’t been able to find any documentation for how to make this
>> combination work.
>>
>>
>>
>> We are using the Spark-Cassandra-Connector in our applications, but
>> haven’t been able to figure out how to get the Spark SQL Thrift Server to
>> use it and connect to C*. That is the missing piece. Once we solve that
>> piece of the puzzle then Tableau should be able to see the tables in C*.
>>
>>
>>
>> Hi Pawan,
>>
>> Tableau + C* is pretty straight forward, especially if you are using DSE.
>> Create a new DSN in Tableau using the ODBC driver that comes with DSE. Once
>> you connect, Tableau allows to use C* keyspace as schema and column
>> families as tables.
>>
>>
>>
>> Mohammed
>>
>>
>>
>> *From:* pawan kumar [mailto:pkv...@gmail.com]
>> *Sent:* Friday, April 3, 2015 7:41 AM
>> *To:* Todd Nist
>> *Cc:* user@spark.apache.org; Mohammed Guller
>> *Subject:* Re: Tableau + Spark SQL Thrift Server + Cassandra
>>
>>
>>
>> Hi Todd,
>>
>> Thanks for the link. I would be interested in this solution. I am using
>> DSE for cassandra. Would you provide me with info on connecting with DSE
>> either through Tableau or zeppelin. The goal here is query cassandra
>> through spark sql so that I could perform joins and groupby on my queries.
>> Are you able to perform spark sql queries with tableau?
>>
>> Thanks,
>> Pawan Venugopal
>>
>> On Apr 3, 2015 5:03 AM, "Todd Nist"  wrote:
>>
>> What version of Cassandra are you using?  Are you using DSE or the stock
>> Apache Cassandra version?  I have connected it with DSE, but have not
>> attempted it with the standard Apache Cassandra version.
>>
>>
>>
>> FWIW,
>> http://www.datastax.com/dev/blog/datastax-odbc-cql-connector-apache-cassandra-datastax-enterprise,
>> provides an ODBC driver tor accessing C* from Tableau.  Granted it does not
>> provide all the goodness of Spark.  Are you attempting to leverage the
>> spark-cassandra-connector for this?
>>
>>
>>
>>
>>
>>
>>
>> On Thu, Apr 2, 2015 at 10:20 PM, Mohammed Guller 
>> wrote:
>>
>> Hi –
>>
>>
>>
>> Is anybody using Tableau to analyze data in Cassandra through the Spark
>> SQL Thrift Server?
>>
>>
>>
>> Thanks!
>>
>>
>>
>> Mohammed
>>
>>
>>
>>
>>
>


Re: About Waiting batches on the spark streaming UI

2015-04-03 Thread Tathagata Das
Maybe that should be marked as waiting as well. Will keep that in mind. We
plan to update the ui soon, so will keep that in mind.
On Apr 3, 2015 10:12 AM, "Ted Yu"  wrote:

> Maybe add another stat for batches waiting in the job queue ?
>
> Cheers
>
> On Fri, Apr 3, 2015 at 10:01 AM, Tathagata Das 
> wrote:
>
>> Very good question! This is because the current code is written such that
>> the ui considers a batch as waiting only when it has actually started being
>> processed. Thats batched waiting in the job queue is not considered in the
>> calculation. It is arguable that it may be more intuitive to count that in
>> the waiting as well.
>> On Apr 3, 2015 12:59 AM, "bit1...@163.com"  wrote:
>>
>>>
>>> I copied the following from the spark streaming UI, I don't know why the
>>> Waiting batches is 1, my understanding is that it should be 72.
>>> Following  is my understanding:
>>> 1. Total time is 1minute 35 seconds=95 seconds
>>> 2. Batch interval is 1 second, so, 95 batches are generated in 95
>>> seconds.
>>> 3. Processed batches are 23(Correct, because in my processing code, it
>>> does nothing but sleep 4 seconds)
>>> 4. Then the waiting batches should be 95-23=72
>>>
>>>
>>>
>>>- *Started at: * Fri Apr 03 15:17:47 CST 2015
>>>- *Time since start: *1 minute 35 seconds
>>>- *Network receivers: *1
>>>- *Batch interval: *1 second
>>>- *Processed batches: *23
>>>- *Waiting batches: *1
>>>- *Received records: *0
>>>- *Processed records: *0
>>>
>>>
>>> --
>>> bit1...@163.com
>>>
>>
>


Re: Matei Zaharai: Reddit Ask Me Anything

2015-04-03 Thread ben lorica
Happening right now

   
https://www.reddit.com/r/IAmA/comments/31bkue/im_matei_zaharia_creator_of_spark_and_cto_at/



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Matei-Zaharai-Reddit-Ask-Me-Anything-tp22364p22369.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: About Waiting batches on the spark streaming UI

2015-04-03 Thread Ted Yu
Maybe add another stat for batches waiting in the job queue ?

Cheers

On Fri, Apr 3, 2015 at 10:01 AM, Tathagata Das  wrote:

> Very good question! This is because the current code is written such that
> the ui considers a batch as waiting only when it has actually started being
> processed. Thats batched waiting in the job queue is not considered in the
> calculation. It is arguable that it may be more intuitive to count that in
> the waiting as well.
> On Apr 3, 2015 12:59 AM, "bit1...@163.com"  wrote:
>
>>
>> I copied the following from the spark streaming UI, I don't know why the
>> Waiting batches is 1, my understanding is that it should be 72.
>> Following  is my understanding:
>> 1. Total time is 1minute 35 seconds=95 seconds
>> 2. Batch interval is 1 second, so, 95 batches are generated in 95 seconds.
>> 3. Processed batches are 23(Correct, because in my processing code, it
>> does nothing but sleep 4 seconds)
>> 4. Then the waiting batches should be 95-23=72
>>
>>
>>
>>- *Started at: * Fri Apr 03 15:17:47 CST 2015
>>- *Time since start: *1 minute 35 seconds
>>- *Network receivers: *1
>>- *Batch interval: *1 second
>>- *Processed batches: *23
>>- *Waiting batches: *1
>>- *Received records: *0
>>- *Processed records: *0
>>
>>
>> --
>> bit1...@163.com
>>
>


Re: About Waiting batches on the spark streaming UI

2015-04-03 Thread Tathagata Das
Very good question! This is because the current code is written such that
the ui considers a batch as waiting only when it has actually started being
processed. Thats batched waiting in the job queue is not considered in the
calculation. It is arguable that it may be more intuitive to count that in
the waiting as well.
On Apr 3, 2015 12:59 AM, "bit1...@163.com"  wrote:

>
> I copied the following from the spark streaming UI, I don't know why the
> Waiting batches is 1, my understanding is that it should be 72.
> Following  is my understanding:
> 1. Total time is 1minute 35 seconds=95 seconds
> 2. Batch interval is 1 second, so, 95 batches are generated in 95 seconds.
> 3. Processed batches are 23(Correct, because in my processing code, it
> does nothing but sleep 4 seconds)
> 4. Then the waiting batches should be 95-23=72
>
>
>
>- *Started at: * Fri Apr 03 15:17:47 CST 2015
>- *Time since start: *1 minute 35 seconds
>- *Network receivers: *1
>- *Batch interval: *1 second
>- *Processed batches: *23
>- *Waiting batches: *1
>- *Received records: *0
>- *Processed records: *0
>
>
> --
> bit1...@163.com
>


MLlib: save models to HDFS?

2015-04-03 Thread S. Zhou
I am new to MLib so I have a basic question: is it possible to save MLlib 
models (particularly CF models) to HDFS and then reload it later? If yes, could 
u share some sample code (I could not find it in MLlib tutorial). Thanks!

RE: Reading a large file (binary) into RDD

2015-04-03 Thread java8964
Hadoop TextInputFormat is a good start.
It is not really that hard. You just need to implement the logic to identify 
the Record delimiter, and think a logic way to represent the  for 
your RecordReader.
Yong

From: kvi...@vt.edu
Date: Fri, 3 Apr 2015 11:41:13 -0400
Subject: Re: Reading a large file (binary) into RDD
To: deanwamp...@gmail.com
CC: java8...@hotmail.com; user@spark.apache.org

Thanks everyone for the inputs.
I guess I will try out a custom implementation of InputFormat. But I have no 
idea where to start. Are there any code examples of this that might help?
On Fri, Apr 3, 2015 at 9:15 AM, Dean Wampler  wrote:
This might be overkill for your needs, but the scodec parser combinator library 
might be useful for creating a parser.
https://github.com/scodec/scodec
Dean Wampler, Ph.D.Author: Programming Scala, 2nd Edition (O'Reilly)
Typesafe
@deanwamplerhttp://polyglotprogramming.com

On Thu, Apr 2, 2015 at 6:53 PM, java8964  wrote:



I think implementing your own InputFormat and using SparkContext.hadoopFile() 
is the best option for your case.
Yong

From: kvi...@vt.edu
Date: Thu, 2 Apr 2015 17:31:30 -0400
Subject: Re: Reading a large file (binary) into RDD
To: freeman.jer...@gmail.com
CC: user@spark.apache.org

The file has a specific structure. I outline it below.
The input file is basically a representation of a graph.

INTINT(A)LONG (B)A INTs(Degrees)A SHORTINTs  
(Vertex_Attribute)B INTsB INTsB SHORTINTsB SHORTINTs

A - number of verticesB - number of edges (note that the INTs/SHORTINTs 
associated with this are edge attributes)
After reading in the file, I need to create two RDDs (one with vertices and the 
other with edges)
On Thu, Apr 2, 2015 at 4:46 PM, Jeremy Freeman  wrote:
Hm, that will indeed be trickier because this method assumes records are the 
same byte size. Is the file an arbitrary sequence of mixed types, or is there 
structure, e.g. short, long, short, long, etc.? 
If you could post a gist with an example of the kind of file and how it should 
look once read in that would be useful!


-
jeremyfreeman.net
@thefreemanlab



On Apr 2, 2015, at 2:09 PM, Vijayasarathy Kannan  wrote:
Thanks for the reply. Unfortunately, in my case, the binary file is a mix of 
short and long integers. Is there any other way that could of use here?
My current method happens to have a large overhead (much more than actual 
computation time). Also, I am short of memory at the driver when it has to read 
the entire file.
On Thu, Apr 2, 2015 at 1:44 PM, Jeremy Freeman  wrote:
If it’s a flat binary file and each record is the same length (in bytes), you 
can use Spark’s binaryRecords method (defined on the SparkContext), which loads 
records from one or more large flat binary files into an RDD. Here’s an example 
in python to show how it works:
# write data from an arrayfrom numpy import randomdat = random.randn(100,5)f = 
open('test.bin', 'w')f.write(dat)f.close()
# load the data back infrom numpy import frombuffernrecords = 5bytesize = 
8recordsize = nrecords * bytesizedata = sc.binaryRecords('test.bin', 
recordsize)parsed = data.map(lambda v: frombuffer(buffer(v, 0, recordsize), 
'float'))

# these should be equalparsed.first()dat[0,:]
Does that help?
-
jeremyfreeman.net
@thefreemanlab


On Apr 2, 2015, at 1:33 PM, Vijayasarathy Kannan  wrote:
What are some efficient ways to read a large file into RDDs?
For example, have several executors read a specific/unique portion of the file 
and construct RDDs. Is this possible to do in Spark?
Currently, I am doing a line-by-line read of the file at the driver and 
constructing the RDD.





  



  

Spark Memory Utilities

2015-04-03 Thread Stephen Carman
I noticed spark has some nice memory tracking estimators in it, but they are 
private. We have some custom implementations of RDD and PairRDD to suit our 
internal needs and it’d be fantastic if we’d be able to just leverage the 
memory estimates that already exist in Spark.

Is there any change they can be made public inside the library or have some 
interface to them such that children classes can make use of them?

Thanks,

Stephen Carman, M.S.
AI Engineer, Coldlight Solutions, LLC
Cell - 267 240 0363
This e-mail is intended solely for the above-mentioned recipient and it may 
contain confidential or privileged information. If you have received it in 
error, please notify us immediately and delete the e-mail. You must not copy, 
distribute, disclose or take any action in reliance on it. In addition, the 
contents of an attachment to this e-mail may contain software viruses which 
could damage your own computer system. While ColdLight Solutions, LLC has taken 
every reasonable precaution to minimize this risk, we cannot accept liability 
for any damage which you sustain as a result of software viruses. You should 
perform your own virus checks before opening the attachment.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Reading a large file (binary) into RDD

2015-04-03 Thread Vijayasarathy Kannan
Thanks everyone for the inputs.

I guess I will try out a custom implementation of InputFormat. But I have
no idea where to start. Are there any code examples of this that might help?

On Fri, Apr 3, 2015 at 9:15 AM, Dean Wampler  wrote:

> This might be overkill for your needs, but the scodec parser combinator
> library might be useful for creating a parser.
>
> https://github.com/scodec/scodec
>
> Dean Wampler, Ph.D.
> Author: Programming Scala, 2nd Edition
>  (O'Reilly)
> Typesafe 
> @deanwampler 
> http://polyglotprogramming.com
>
> On Thu, Apr 2, 2015 at 6:53 PM, java8964  wrote:
>
>> I think implementing your own InputFormat and using
>> SparkContext.hadoopFile() is the best option for your case.
>>
>> Yong
>>
>> --
>> From: kvi...@vt.edu
>> Date: Thu, 2 Apr 2015 17:31:30 -0400
>> Subject: Re: Reading a large file (binary) into RDD
>> To: freeman.jer...@gmail.com
>> CC: user@spark.apache.org
>>
>>
>> The file has a specific structure. I outline it below.
>>
>> The input file is basically a representation of a graph.
>>
>> INT
>> INT(A)
>> LONG (B)
>> A INTs(Degrees)
>> A SHORTINTs  (Vertex_Attribute)
>> B INTs
>> B INTs
>> B SHORTINTs
>> B SHORTINTs
>>
>> A - number of vertices
>> B - number of edges (note that the INTs/SHORTINTs associated with this
>> are edge attributes)
>>
>> After reading in the file, I need to create two RDDs (one with vertices
>> and the other with edges)
>>
>> On Thu, Apr 2, 2015 at 4:46 PM, Jeremy Freeman 
>> wrote:
>>
>> Hm, that will indeed be trickier because this method assumes records are
>> the same byte size. Is the file an arbitrary sequence of mixed types, or is
>> there structure, e.g. short, long, short, long, etc.?
>>
>> If you could post a gist with an example of the kind of file and how it
>> should look once read in that would be useful!
>>
>> -
>> jeremyfreeman.net
>> @thefreemanlab
>>
>> On Apr 2, 2015, at 2:09 PM, Vijayasarathy Kannan  wrote:
>>
>> Thanks for the reply. Unfortunately, in my case, the binary file is a mix
>> of short and long integers. Is there any other way that could of use here?
>>
>> My current method happens to have a large overhead (much more than actual
>> computation time). Also, I am short of memory at the driver when it has to
>> read the entire file.
>>
>> On Thu, Apr 2, 2015 at 1:44 PM, Jeremy Freeman 
>> wrote:
>>
>> If it’s a flat binary file and each record is the same length (in bytes),
>> you can use Spark’s binaryRecords method (defined on the SparkContext),
>> which loads records from one or more large flat binary files into an RDD.
>> Here’s an example in python to show how it works:
>>
>> # write data from an array
>> from numpy import random
>> dat = random.randn(100,5)
>> f = open('test.bin', 'w')
>> f.write(dat)
>> f.close()
>>
>>
>> # load the data back in
>>
>> from numpy import frombuffer
>>
>> nrecords = 5
>> bytesize = 8
>> recordsize = nrecords * bytesize
>> data = sc.binaryRecords('test.bin', recordsize)
>> parsed = data.map(lambda v: frombuffer(buffer(v, 0, recordsize), 'float'))
>>
>>
>> # these should be equal
>> parsed.first()
>> dat[0,:]
>>
>>
>> Does that help?
>>
>> -
>> jeremyfreeman.net
>> @thefreemanlab
>>
>> On Apr 2, 2015, at 1:33 PM, Vijayasarathy Kannan  wrote:
>>
>> What are some efficient ways to read a large file into RDDs?
>>
>> For example, have several executors read a specific/unique portion of the
>> file and construct RDDs. Is this possible to do in Spark?
>>
>> Currently, I am doing a line-by-line read of the file at the driver and
>> constructing the RDD.
>>
>>
>>
>>
>>
>>
>


Re: Which OS for Spark cluster nodes?

2015-04-03 Thread Charles Feduke
As Akhil says Ubuntu is a good choice if you're starting from near scratch.

Cloudera CDH virtual machine images[1] include Hadoop, HDFS, Spark, and
other big data tools so you can get a cluster running with very little
effort. Keep in mind Cloudera is a for-profit corporation so they are also
selling a product.

Personally I prefer the EC2 scripts[2] that ship with the downloadable
Spark distribution. It provisions a cluster for you on AWS and you can
easily terminate the cluster when you don't need it. Ganglia (monitoring),
HDFS (ephemeral and EBS backed), Tachyon (caching), and Spark are all
installed automatically. For learning, using a cluster of 4 medium machines
is fairly inexpensive. (I use the EC2 scripts for both an integration and
production environment.)

1.
http://www.cloudera.com/content/cloudera/en/products-and-services/cdh.html
2. https://spark.apache.org/docs/latest/ec2-scripts.html

On Fri, Apr 3, 2015 at 7:38 AM Akhil Das  wrote:

> There isn't any specific Linux distro, but i would prefer Ubuntu for a
> beginner as its very easy to apt-get install stuffs on it.
>
> Thanks
> Best Regards
>
> On Fri, Apr 3, 2015 at 4:58 PM, Horsmann, Tobias <
> tobias.horsm...@uni-due.de> wrote:
>
>>  Hi,
>> Are there any recommendations for operating systems that one should use
>> for setting up Spark/Hadoop nodes in general?
>> I am not familiar with the differences between the various linux
>> distributions or how well they are (not) suited for cluster set-ups, so I
>> wondered if there is some preferred choices?
>>
>>  Regards,
>>
>>
>


Re: How to get a top X percent of a distribution represented as RDD

2015-04-03 Thread Debasish Das
Cool !

You should also consider to contribute it back to spark if you are doing
quantile calculations for example...there is also topbykey api added in
master by @coderxiangsee if you can use that API to make the code
clean
On Apr 3, 2015 5:20 AM, "Aung Htet"  wrote:

> Hi Debasish, Charles,
>
> I solved the problem by using a BPQ like method, based on your
> suggestions. So thanks very much for that!
>
> My approach was
> 1) Count the population of each segment in the RDD by map/reduce so that I
> get the bound number N equivalent to 10% of each segment. This becomes the
> size of the BPQ.
> 2) Associate the bounds N to the corresponding records in the first RDD.
> 3) Reduce the RDD from step 2 by merging the values in every two rows,
> basically creating a sorted list (Indexed Seq)
> 4) If the size of the sorted list is greater than N (the bound) then,
> create a new sorted list by using a priority queue and dequeuing top N
> values.
>
> In the end, I get a record for each segment with N max values for each
> segment.
>
> Regards,
> Aung
>
>
>
>
>
>
>
>
> On Fri, Mar 27, 2015 at 4:27 PM, Debasish Das 
> wrote:
>
>> In that case you can directly use count-min-sketch from algebirdthey
>> work fine with Spark aggregateBy but I have found the java BPQ from Spark
>> much faster than say algebird Heap datastructure...
>>
>> On Thu, Mar 26, 2015 at 10:04 PM, Charles Hayden <
>> charles.hay...@atigeo.com> wrote:
>>
>>>  ​You could also consider using a count-min data structure such as in
>>> https://github.com/laserson/dsq​
>>>
>>> to get approximate quantiles, then use whatever values you want to
>>> filter the original sequence.
>>>  --
>>> *From:* Debasish Das 
>>> *Sent:* Thursday, March 26, 2015 9:45 PM
>>> *To:* Aung Htet
>>> *Cc:* user
>>> *Subject:* Re: How to get a top X percent of a distribution represented
>>> as RDD
>>>
>>>  Idea is to use a heap and get topK elements from every
>>> partition...then use aggregateBy and for combOp do a merge routine from
>>> mergeSort...basically get 100 items from partition 1, 100 items from
>>> partition 2, merge them so that you get sorted 200 items and take 100...for
>>> merge you can use heap as well...Matei had a BPQ inside Spark which we use
>>> all the time...Passing arrays over wire is better than passing full heap
>>> objects and merge routine on array should run faster but needs experiment...
>>>
>>> On Thu, Mar 26, 2015 at 9:26 PM, Aung Htet  wrote:
>>>
 Hi Debasish,

 Thanks for your suggestions. In-memory version is quite useful. I do
 not quite understand how you can use aggregateBy to get 10% top K elements.
 Can you please give an example?

 Thanks,
 Aung

 On Fri, Mar 27, 2015 at 2:40 PM, Debasish Das >>> > wrote:

> You can do it in-memory as wellget 10% topK elements from each
> partition and use merge from any sort algorithm like timsortbasically
> aggregateBy
>
>  Your version uses shuffle but this version is 0 shuffle..assuming
> your data set is cached you will be using in-memory allReduce through
> treeAggregate...
>
>  But this is only good for top 10% or bottom 10%...if you need to do
> it for top 30% then may be the shuffle version will work better...
>
> On Thu, Mar 26, 2015 at 8:31 PM, Aung Htet  wrote:
>
>> Hi all,
>>
>>  I have a distribution represented as an RDD of tuples, in rows of
>> (segment, score)
>> For each segment, I want to discard tuples with top X percent scores.
>> This seems hard to do in Spark RDD.
>>
>>  A naive algorithm would be -
>>
>>  1) Sort RDD by segment & score (descending)
>> 2) Within each segment, number the rows from top to bottom.
>> 3) For each  segment, calculate the cut off index. i.e. 90 for 10%
>> cut off out of a segment with 100 rows.
>> 4) For the entire RDD, filter rows with row num <= cut off index
>>
>> This does not look like a good algorithm. I would really appreciate
>> if someone can suggest a better way to implement this in Spark.
>>
>>  Regards,
>> Aung
>>
>
>

>>>
>>
>


RE: Tableau + Spark SQL Thrift Server + Cassandra

2015-04-03 Thread pawan kumar
Thanks mohammed. Will give it a try today. We would also need the sparksSQL
piece as we are migrating our data store from oracle to C* and it would be
easier to maintain all the reports rather recreating each one from scratch.

Thanks,
Pawan Venugopal.
On Apr 3, 2015 7:59 AM, "Mohammed Guller"  wrote:

>  Hi Todd,
>
>
>
> We are using Apache C* 2.1.3, not DSE. We got Tableau to work directly
> with C* using the ODBC driver, but now would like to add Spark SQL to the
> mix. I haven’t been able to find any documentation for how to make this
> combination work.
>
>
>
> We are using the Spark-Cassandra-Connector in our applications, but
> haven’t been able to figure out how to get the Spark SQL Thrift Server to
> use it and connect to C*. That is the missing piece. Once we solve that
> piece of the puzzle then Tableau should be able to see the tables in C*.
>
>
>
> Hi Pawan,
>
> Tableau + C* is pretty straight forward, especially if you are using DSE.
> Create a new DSN in Tableau using the ODBC driver that comes with DSE. Once
> you connect, Tableau allows to use C* keyspace as schema and column
> families as tables.
>
>
>
> Mohammed
>
>
>
> *From:* pawan kumar [mailto:pkv...@gmail.com]
> *Sent:* Friday, April 3, 2015 7:41 AM
> *To:* Todd Nist
> *Cc:* user@spark.apache.org; Mohammed Guller
> *Subject:* Re: Tableau + Spark SQL Thrift Server + Cassandra
>
>
>
> Hi Todd,
>
> Thanks for the link. I would be interested in this solution. I am using
> DSE for cassandra. Would you provide me with info on connecting with DSE
> either through Tableau or zeppelin. The goal here is query cassandra
> through spark sql so that I could perform joins and groupby on my queries.
> Are you able to perform spark sql queries with tableau?
>
> Thanks,
> Pawan Venugopal
>
> On Apr 3, 2015 5:03 AM, "Todd Nist"  wrote:
>
> What version of Cassandra are you using?  Are you using DSE or the stock
> Apache Cassandra version?  I have connected it with DSE, but have not
> attempted it with the standard Apache Cassandra version.
>
>
>
> FWIW,
> http://www.datastax.com/dev/blog/datastax-odbc-cql-connector-apache-cassandra-datastax-enterprise,
> provides an ODBC driver tor accessing C* from Tableau.  Granted it does not
> provide all the goodness of Spark.  Are you attempting to leverage the
> spark-cassandra-connector for this?
>
>
>
>
>
>
>
> On Thu, Apr 2, 2015 at 10:20 PM, Mohammed Guller 
> wrote:
>
> Hi –
>
>
>
> Is anybody using Tableau to analyze data in Cassandra through the Spark
> SQL Thrift Server?
>
>
>
> Thanks!
>
>
>
> Mohammed
>
>
>
>
>


RE: Tableau + Spark SQL Thrift Server + Cassandra

2015-04-03 Thread Mohammed Guller
Hi Todd,

We are using Apache C* 2.1.3, not DSE. We got Tableau to work directly with C* 
using the ODBC driver, but now would like to add Spark SQL to the mix. I 
haven’t been able to find any documentation for how to make this combination 
work.

We are using the Spark-Cassandra-Connector in our applications, but haven’t 
been able to figure out how to get the Spark SQL Thrift Server to use it and 
connect to C*. That is the missing piece. Once we solve that piece of the 
puzzle then Tableau should be able to see the tables in C*.

Hi Pawan,
Tableau + C* is pretty straight forward, especially if you are using DSE. 
Create a new DSN in Tableau using the ODBC driver that comes with DSE. Once you 
connect, Tableau allows to use C* keyspace as schema and column families as 
tables.

Mohammed

From: pawan kumar [mailto:pkv...@gmail.com]
Sent: Friday, April 3, 2015 7:41 AM
To: Todd Nist
Cc: user@spark.apache.org; Mohammed Guller
Subject: Re: Tableau + Spark SQL Thrift Server + Cassandra


Hi Todd,

Thanks for the link. I would be interested in this solution. I am using DSE for 
cassandra. Would you provide me with info on connecting with DSE either through 
Tableau or zeppelin. The goal here is query cassandra through spark sql so that 
I could perform joins and groupby on my queries. Are you able to perform spark 
sql queries with tableau?

Thanks,
Pawan Venugopal
On Apr 3, 2015 5:03 AM, "Todd Nist" 
mailto:tsind...@gmail.com>> wrote:
What version of Cassandra are you using?  Are you using DSE or the stock Apache 
Cassandra version?  I have connected it with DSE, but have not attempted it 
with the standard Apache Cassandra version.

FWIW, 
http://www.datastax.com/dev/blog/datastax-odbc-cql-connector-apache-cassandra-datastax-enterprise,
 provides an ODBC driver tor accessing C* from Tableau.  Granted it does not 
provide all the goodness of Spark.  Are you attempting to leverage the 
spark-cassandra-connector for this?



On Thu, Apr 2, 2015 at 10:20 PM, Mohammed Guller 
mailto:moham...@glassbeam.com>> wrote:
Hi –

Is anybody using Tableau to analyze data in Cassandra through the Spark SQL 
Thrift Server?

Thanks!

Mohammed




Re: Tableau + Spark SQL Thrift Server + Cassandra

2015-04-03 Thread pawan kumar
Hi Todd,

Thanks for the link. I would be interested in this solution. I am using DSE
for cassandra. Would you provide me with info on connecting with DSE either
through Tableau or zeppelin. The goal here is query cassandra through spark
sql so that I could perform joins and groupby on my queries. Are you able
to perform spark sql queries with tableau?

Thanks,
Pawan Venugopal
On Apr 3, 2015 5:03 AM, "Todd Nist"  wrote:

> What version of Cassandra are you using?  Are you using DSE or the stock
> Apache Cassandra version?  I have connected it with DSE, but have not
> attempted it with the standard Apache Cassandra version.
>
> FWIW,
> http://www.datastax.com/dev/blog/datastax-odbc-cql-connector-apache-cassandra-datastax-enterprise,
> provides an ODBC driver tor accessing C* from Tableau.  Granted it does not
> provide all the goodness of Spark.  Are you attempting to leverage the
> spark-cassandra-connector for this?
>
>
>
> On Thu, Apr 2, 2015 at 10:20 PM, Mohammed Guller 
> wrote:
>
>>  Hi –
>>
>>
>>
>> Is anybody using Tableau to analyze data in Cassandra through the Spark
>> SQL Thrift Server?
>>
>>
>>
>> Thanks!
>>
>>
>>
>> Mohammed
>>
>>
>>
>
>


Re: Spark Job Failed - Class not serializable

2015-04-03 Thread Frank Austin Nothaft
You’ll definitely want to use a Kryo-based serializer for Avro. We have a Kryo 
based serializer that wraps the Avro efficient serializer here.

Frank Austin Nothaft
fnoth...@berkeley.edu
fnoth...@eecs.berkeley.edu
202-340-0466

On Apr 3, 2015, at 5:41 AM, Akhil Das  wrote:

> Because, its throwing up serializable exceptions and kryo is a serializer to 
> serialize your objects.
> 
> Thanks
> Best Regards
> 
> On Fri, Apr 3, 2015 at 5:37 PM, Deepak Jain  wrote:
> I meant that I did not have to use kyro. Why will kyro help fix this issue 
> now ?
> 
> Sent from my iPhone
> 
> On 03-Apr-2015, at 5:36 pm, Deepak Jain  wrote:
> 
>> I was able to write record that extends specificrecord (avro) this class was 
>> not auto generated. Do we need to do something extra for auto generated 
>> classes 
>> 
>> Sent from my iPhone
>> 
>> On 03-Apr-2015, at 5:06 pm, Akhil Das  wrote:
>> 
>>> This thread might give you some insights 
>>> http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201311.mbox/%3CCA+WVT8WXbEHac=N0GWxj-s9gqOkgG0VRL5B=ovjwexqm8ev...@mail.gmail.com%3E
>>> 
>>> Thanks
>>> Best Regards
>>> 
>>> On Fri, Apr 3, 2015 at 3:53 PM, ÐΞ€ρ@Ҝ (๏̯͡๏)  wrote:
>>> My Spark Job failed with
>>> 
>>> 
>>> 15/04/03 03:15:36 INFO scheduler.DAGScheduler: Job 0 failed: 
>>> saveAsNewAPIHadoopFile at AbstractInputHelper.scala:103, took 2.480175 s
>>> 15/04/03 03:15:36 ERROR yarn.ApplicationMaster: User class threw exception: 
>>> Job aborted due to stage failure: Task 0.0 in stage 2.0 (TID 0) had a not 
>>> serializable result: 
>>> com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
>>> Serialization stack:
>>> - object not serializable (class: 
>>> com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum, value: 
>>> {"userId": 0, "spsPrgrmId": 0, "spsSlrLevelCd": 0, "spsSlrLevelSumStartDt": 
>>> null, "spsSlrLevelSumEndDt": null, "currPsLvlId": null})
>>> - field (class: scala.Tuple2, name: _2, type: class java.lang.Object)
>>> - object (class scala.Tuple2, (0,{"userId": 0, "spsPrgrmId": 0, 
>>> "spsSlrLevelCd": 0, "spsSlrLevelSumStartDt": null, "spsSlrLevelSumEndDt": 
>>> null, "currPsLvlId": null}))
>>> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 
>>> in stage 2.0 (TID 0) had a not serializable result: 
>>> com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
>>> Serialization stack:
>>> - object not serializable (class: 
>>> com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum, value: 
>>> {"userId": 0, "spsPrgrmId": 0, "spsSlrLevelCd": 0, "spsSlrLevelSumStartDt": 
>>> null, "spsSlrLevelSumEndDt": null, "currPsLvlId": null})
>>> - field (class: scala.Tuple2, name: _2, type: class java.lang.Object)
>>> - object (class scala.Tuple2, (0,{"userId": 0, "spsPrgrmId": 0, 
>>> "spsSlrLevelCd": 0, "spsSlrLevelSumStartDt": null, "spsSlrLevelSumEndDt": 
>>> null, "currPsLvlId": null}))
>>> at 
>>> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
>>> at 
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
>>> at 
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
>>> at 
>>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>> at 
>>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
>>> at 
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
>>> at 
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
>>> at scala.Option.foreach(Option.scala:236)
>>> at 
>>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
>>> at 
>>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
>>> at 
>>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
>>> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>>> 15/04/03 03:15:36 INFO yarn.ApplicationMaster: Final app status: FAILED, 
>>> exitCode: 15, (reason: User class threw exception: Job aborted due to stage 
>>> failure: Task 0.0 in stage 2.0 (TID 0) had a not serializable result: 
>>> com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
>>> Serialization stack:
>>> 
>>> 
>>> 
>>> com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum is auto 
>>> generated through avro schema using avro-generate-sources maven pulgin.
>>> 
>>> 
>>> package com.ebay.ep.poc.spark.reporting.process.model.dw;  
>>> 
>>> @SuppressWarnings("all")
>>> 
>>> @org.apache.avro.specific.AvroGenerated
>>> 
>>> public class SpsLevelMetricSum extends 
>>> org.apache.avro.specific.SpecificRecordBase i

Re: ArrayBuffer within a DataFrame

2015-04-03 Thread Denny Lee
Thanks Dean - fun hack :)

On Fri, Apr 3, 2015 at 6:11 AM Dean Wampler  wrote:

> A hack workaround is to use flatMap:
>
> rdd.flatMap{ case (date, array) => for (x <- array) yield (date, x) }
>
> For those of you who don't know Scala, the for comprehension iterates
> through the ArrayBuffer, named "array" and yields new tuples with the date
> and each element. The case expression to the left of the => pattern matches
> on the input tuples.
>
> Dean Wampler, Ph.D.
> Author: Programming Scala, 2nd Edition
>  (O'Reilly)
> Typesafe 
> @deanwampler 
> http://polyglotprogramming.com
>
> On Thu, Apr 2, 2015 at 10:45 PM, Denny Lee  wrote:
>
>> Thanks Michael - that was it!  I was drawing a blank on this one for some
>> reason - much appreciated!
>>
>>
>> On Thu, Apr 2, 2015 at 8:27 PM Michael Armbrust 
>> wrote:
>>
>>> A lateral view explode using HiveQL.  I'm hopping to add explode
>>> shorthand directly to the df API in 1.4.
>>>
>>> On Thu, Apr 2, 2015 at 7:10 PM, Denny Lee  wrote:
>>>
 Quick question - the output of a dataframe is in the format of:

 [2015-04, ArrayBuffer(A, B, C, D)]

 and I'd like to return it as:

 2015-04, A
 2015-04, B
 2015-04, C
 2015-04, D

 What's the best way to do this?

 Thanks in advance!



>>>
>


Spark unit test fails

2015-04-03 Thread Manas Kar
Hi experts,
 I am trying to write unit tests for my spark application which fails with
javax.servlet.FilterRegistration error.

I am using CDH5.3.2 Spark and below is my dependencies list.
val spark   = "1.2.0-cdh5.3.2"
val esriGeometryAPI = "1.2"
val csvWriter   = "1.0.0"
val hadoopClient= "2.3.0"
val scalaTest   = "2.2.1"
val jodaTime= "1.6.0"
val scalajHTTP  = "1.0.1"
val avro= "1.7.7"
val scopt   = "3.2.0"
val config  = "1.2.1"
val jobserver   = "0.4.1"
val excludeJBossNetty = ExclusionRule(organization = "org.jboss.netty")
val excludeIONetty = ExclusionRule(organization = "io.netty")
val excludeEclipseJetty = ExclusionRule(organization =
"org.eclipse.jetty")
val excludeMortbayJetty = ExclusionRule(organization =
"org.mortbay.jetty")
val excludeAsm = ExclusionRule(organization = "org.ow2.asm")
val excludeOldAsm = ExclusionRule(organization = "asm")
val excludeCommonsLogging = ExclusionRule(organization =
"commons-logging")
val excludeSLF4J = ExclusionRule(organization = "org.slf4j")
val excludeScalap = ExclusionRule(organization = "org.scala-lang",
artifact = "scalap")
val excludeHadoop = ExclusionRule(organization = "org.apache.hadoop")
val excludeCurator = ExclusionRule(organization = "org.apache.curator")
val excludePowermock = ExclusionRule(organization = "org.powermock")
val excludeFastutil = ExclusionRule(organization = "it.unimi.dsi")
val excludeJruby = ExclusionRule(organization = "org.jruby")
val excludeThrift = ExclusionRule(organization = "org.apache.thrift")
val excludeServletApi = ExclusionRule(organization = "javax.servlet",
artifact = "servlet-api")
val excludeJUnit = ExclusionRule(organization = "junit")

I found the link (
http://apache-spark-user-list.1001560.n3.nabble.com/Fwd-SecurityException-when-running-tests-with-Spark-1-0-0-td6747.html#a6749
) talking about the issue and the work around of the same.
But that work around does not get rid of the problem for me.
I am using an SBT build which can't be changed to maven.

What am I missing?


Stack trace
-
[info] FiltersRDDSpec:
[info] - Spark Filter *** FAILED ***
[info]   java.lang.SecurityException: class
"javax.servlet.FilterRegistration"'s signer information does not match
signer information of other classes in the same package
[info]   at java.lang.ClassLoader.checkCerts(Unknown Source)
[info]   at java.lang.ClassLoader.preDefineClass(Unknown Source)
[info]   at java.lang.ClassLoader.defineClass(Unknown Source)
[info]   at java.security.SecureClassLoader.defineClass(Unknown Source)
[info]   at java.net.URLClassLoader.defineClass(Unknown Source)
[info]   at java.net.URLClassLoader.access$100(Unknown Source)
[info]   at java.net.URLClassLoader$1.run(Unknown Source)
[info]   at java.net.URLClassLoader$1.run(Unknown Source)
[info]   at java.security.AccessController.doPrivileged(Native Method)
[info]   at java.net.URLClassLoader.findClass(Unknown Source)

Thanks
Manas


Re: Reading a large file (binary) into RDD

2015-04-03 Thread Dean Wampler
This might be overkill for your needs, but the scodec parser combinator
library might be useful for creating a parser.

https://github.com/scodec/scodec

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
 (O'Reilly)
Typesafe 
@deanwampler 
http://polyglotprogramming.com

On Thu, Apr 2, 2015 at 6:53 PM, java8964  wrote:

> I think implementing your own InputFormat and using
> SparkContext.hadoopFile() is the best option for your case.
>
> Yong
>
> --
> From: kvi...@vt.edu
> Date: Thu, 2 Apr 2015 17:31:30 -0400
> Subject: Re: Reading a large file (binary) into RDD
> To: freeman.jer...@gmail.com
> CC: user@spark.apache.org
>
>
> The file has a specific structure. I outline it below.
>
> The input file is basically a representation of a graph.
>
> INT
> INT(A)
> LONG (B)
> A INTs(Degrees)
> A SHORTINTs  (Vertex_Attribute)
> B INTs
> B INTs
> B SHORTINTs
> B SHORTINTs
>
> A - number of vertices
> B - number of edges (note that the INTs/SHORTINTs associated with this are
> edge attributes)
>
> After reading in the file, I need to create two RDDs (one with vertices
> and the other with edges)
>
> On Thu, Apr 2, 2015 at 4:46 PM, Jeremy Freeman 
> wrote:
>
> Hm, that will indeed be trickier because this method assumes records are
> the same byte size. Is the file an arbitrary sequence of mixed types, or is
> there structure, e.g. short, long, short, long, etc.?
>
> If you could post a gist with an example of the kind of file and how it
> should look once read in that would be useful!
>
> -
> jeremyfreeman.net
> @thefreemanlab
>
> On Apr 2, 2015, at 2:09 PM, Vijayasarathy Kannan  wrote:
>
> Thanks for the reply. Unfortunately, in my case, the binary file is a mix
> of short and long integers. Is there any other way that could of use here?
>
> My current method happens to have a large overhead (much more than actual
> computation time). Also, I am short of memory at the driver when it has to
> read the entire file.
>
> On Thu, Apr 2, 2015 at 1:44 PM, Jeremy Freeman 
> wrote:
>
> If it’s a flat binary file and each record is the same length (in bytes),
> you can use Spark’s binaryRecords method (defined on the SparkContext),
> which loads records from one or more large flat binary files into an RDD.
> Here’s an example in python to show how it works:
>
> # write data from an array
> from numpy import random
> dat = random.randn(100,5)
> f = open('test.bin', 'w')
> f.write(dat)
> f.close()
>
>
> # load the data back in
>
> from numpy import frombuffer
>
> nrecords = 5
> bytesize = 8
> recordsize = nrecords * bytesize
> data = sc.binaryRecords('test.bin', recordsize)
> parsed = data.map(lambda v: frombuffer(buffer(v, 0, recordsize), 'float'))
>
>
> # these should be equal
> parsed.first()
> dat[0,:]
>
>
> Does that help?
>
> -
> jeremyfreeman.net
> @thefreemanlab
>
> On Apr 2, 2015, at 1:33 PM, Vijayasarathy Kannan  wrote:
>
> What are some efficient ways to read a large file into RDDs?
>
> For example, have several executors read a specific/unique portion of the
> file and construct RDDs. Is this possible to do in Spark?
>
> Currently, I am doing a line-by-line read of the file at the driver and
> constructing the RDD.
>
>
>
>
>
>


Re: ArrayBuffer within a DataFrame

2015-04-03 Thread Dean Wampler
A hack workaround is to use flatMap:

rdd.flatMap{ case (date, array) => for (x <- array) yield (date, x) }

For those of you who don't know Scala, the for comprehension iterates
through the ArrayBuffer, named "array" and yields new tuples with the date
and each element. The case expression to the left of the => pattern matches
on the input tuples.

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
 (O'Reilly)
Typesafe 
@deanwampler 
http://polyglotprogramming.com

On Thu, Apr 2, 2015 at 10:45 PM, Denny Lee  wrote:

> Thanks Michael - that was it!  I was drawing a blank on this one for some
> reason - much appreciated!
>
>
> On Thu, Apr 2, 2015 at 8:27 PM Michael Armbrust 
> wrote:
>
>> A lateral view explode using HiveQL.  I'm hopping to add explode
>> shorthand directly to the df API in 1.4.
>>
>> On Thu, Apr 2, 2015 at 7:10 PM, Denny Lee  wrote:
>>
>>> Quick question - the output of a dataframe is in the format of:
>>>
>>> [2015-04, ArrayBuffer(A, B, C, D)]
>>>
>>> and I'd like to return it as:
>>>
>>> 2015-04, A
>>> 2015-04, B
>>> 2015-04, C
>>> 2015-04, D
>>>
>>> What's the best way to do this?
>>>
>>> Thanks in advance!
>>>
>>>
>>>
>>


Re: Spark 1.3 UDF ClassNotFoundException

2015-04-03 Thread Markus Ganter
My apologizes. I was running this locally and the JAR I was building
using Intellij had some issues.
This was not related to UDFs. All works fine now.

On Thu, Apr 2, 2015 at 2:58 PM, Ted Yu  wrote:
> Can you show more code in CreateMasterData ?
>
> How do you run your code ?
>
> Thanks
>
> On Thu, Apr 2, 2015 at 11:06 AM, ganterm  wrote:
>>
>> Hello,
>>
>> I started to use the dataframe API in Spark 1.3 with Scala.
>> I am trying to implement a UDF and am following the sample here:
>>
>> https://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.sql.UserDefinedFunction
>> meaning
>> val predict = udf((score: Double) => if (score > 0.5) true else false)
>> df.select( predict(df("score")) )
>> All compiles just fine but when I run it, I get a ClassNotFoundException
>> (see more details below)
>> I am sure that I load the data correctly and that I have a field called
>> "score" with the correct data type.
>> Do I need to do anything else like registering the function?
>>
>> Thanks!
>> Markus
>>
>> Exception in thread "main" org.apache.spark.SparkException: Job aborted
>> due
>> to stage failure: Task 0 in stage 6.0 failed 4 times, most recent failure:
>> Lost task 0.3 in stage 6.0 (TID 11, BillSmithPC):
>> java.lang.ClassNotFoundException: test.CreateMasterData$$anonfun$1
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>> at java.lang.Class.forName0(Native Method)
>> at java.lang.Class.forName(Class.java:270)
>> at
>>
>> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:65)
>> ...
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-UDF-ClassNotFoundException-tp22361.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Job Failed - Class not serializable

2015-04-03 Thread Akhil Das
Because, its throwing up serializable exceptions and kryo is a serializer
to serialize your objects.

Thanks
Best Regards

On Fri, Apr 3, 2015 at 5:37 PM, Deepak Jain  wrote:

> I meant that I did not have to use kyro. Why will kyro help fix this issue
> now ?
>
> Sent from my iPhone
>
> On 03-Apr-2015, at 5:36 pm, Deepak Jain  wrote:
>
> I was able to write record that extends specificrecord (avro) this class
> was not auto generated. Do we need to do something extra for auto generated
> classes
>
> Sent from my iPhone
>
> On 03-Apr-2015, at 5:06 pm, Akhil Das  wrote:
>
> This thread might give you some insights
> http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201311.mbox/%3CCA+WVT8WXbEHac=N0GWxj-s9gqOkgG0VRL5B=ovjwexqm8ev...@mail.gmail.com%3E
>
> Thanks
> Best Regards
>
> On Fri, Apr 3, 2015 at 3:53 PM, ÐΞ€ρ@Ҝ (๏̯͡๏)  wrote:
>
>> My Spark Job failed with
>>
>>
>> 15/04/03 03:15:36 INFO scheduler.DAGScheduler: Job 0 failed:
>> saveAsNewAPIHadoopFile at AbstractInputHelper.scala:103, took 2.480175 s
>> 15/04/03 03:15:36 ERROR yarn.ApplicationMaster: User class threw
>> exception: Job aborted due to stage failure: Task 0.0 in stage 2.0 (TID 0)
>> had a not serializable result:
>> com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
>> Serialization stack:
>> - object not serializable (class:
>> com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum, value:
>> {"userId": 0, "spsPrgrmId": 0, "spsSlrLevelCd": 0, "spsSlrLevelSumStartDt":
>> null, "spsSlrLevelSumEndDt": null, "currPsLvlId": null})
>> - field (class: scala.Tuple2, name: _2, type: class java.lang.Object)
>> - object (class scala.Tuple2, (0,{"userId": 0, "spsPrgrmId": 0,
>> "spsSlrLevelCd": 0, "spsSlrLevelSumStartDt": null, "spsSlrLevelSumEndDt":
>> null, "currPsLvlId": null}))
>> org.apache.spark.SparkException: Job aborted due to stage failure: Task
>> 0.0 in stage 2.0 (TID 0) had a not serializable result:
>> com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
>> Serialization stack:
>> - object not serializable (class:
>> com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum, value:
>> {"userId": 0, "spsPrgrmId": 0, "spsSlrLevelCd": 0, "spsSlrLevelSumStartDt":
>> null, "spsSlrLevelSumEndDt": null, "currPsLvlId": null})
>> - field (class: scala.Tuple2, name: _2, type: class java.lang.Object)
>> - object (class scala.Tuple2, (0,{"userId": 0, "spsPrgrmId": 0,
>> "spsSlrLevelCd": 0, "spsSlrLevelSumStartDt": null, "spsSlrLevelSumEndDt":
>> null, "currPsLvlId": null}))
>> at org.apache.spark.scheduler.DAGScheduler.org
>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
>> at
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>> at
>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
>> at scala.Option.foreach(Option.scala:236)
>> at
>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
>> at
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
>> at
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
>> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>> 15/04/03 03:15:36 INFO yarn.ApplicationMaster: Final app status: FAILED,
>> exitCode: 15, (reason: User class threw exception: Job aborted due to stage
>> failure: Task 0.0 in stage 2.0 (TID 0) had a not serializable result:
>> com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
>> Serialization stack:
>>
>> 
>>
>> com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum is
>> auto generated through avro schema using avro-generate-sources maven pulgin.
>>
>>
>> package com.ebay.ep.poc.spark.reporting.process.model.dw;
>>
>> @SuppressWarnings("all")
>>
>> @org.apache.avro.specific.AvroGenerated
>>
>> public class SpsLevelMetricSum extends
>> org.apache.avro.specific.SpecificRecordBase implements
>> org.apache.avro.specific.SpecificRecord {
>> ...
>> ...
>> }
>>
>> Can anyone suggest how to fix this ?
>>
>>
>>
>> --
>> Deepak
>>
>>
>


Re: How to get a top X percent of a distribution represented as RDD

2015-04-03 Thread Aung Htet
Hi Debasish, Charles,

I solved the problem by using a BPQ like method, based on your suggestions.
So thanks very much for that!

My approach was
1) Count the population of each segment in the RDD by map/reduce so that I
get the bound number N equivalent to 10% of each segment. This becomes the
size of the BPQ.
2) Associate the bounds N to the corresponding records in the first RDD.
3) Reduce the RDD from step 2 by merging the values in every two rows,
basically creating a sorted list (Indexed Seq)
4) If the size of the sorted list is greater than N (the bound) then,
create a new sorted list by using a priority queue and dequeuing top N
values.

In the end, I get a record for each segment with N max values for each
segment.

Regards,
Aung








On Fri, Mar 27, 2015 at 4:27 PM, Debasish Das 
wrote:

> In that case you can directly use count-min-sketch from algebirdthey
> work fine with Spark aggregateBy but I have found the java BPQ from Spark
> much faster than say algebird Heap datastructure...
>
> On Thu, Mar 26, 2015 at 10:04 PM, Charles Hayden <
> charles.hay...@atigeo.com> wrote:
>
>>  ​You could also consider using a count-min data structure such as in
>> https://github.com/laserson/dsq​
>>
>> to get approximate quantiles, then use whatever values you want to filter
>> the original sequence.
>>  --
>> *From:* Debasish Das 
>> *Sent:* Thursday, March 26, 2015 9:45 PM
>> *To:* Aung Htet
>> *Cc:* user
>> *Subject:* Re: How to get a top X percent of a distribution represented
>> as RDD
>>
>>  Idea is to use a heap and get topK elements from every partition...then
>> use aggregateBy and for combOp do a merge routine from
>> mergeSort...basically get 100 items from partition 1, 100 items from
>> partition 2, merge them so that you get sorted 200 items and take 100...for
>> merge you can use heap as well...Matei had a BPQ inside Spark which we use
>> all the time...Passing arrays over wire is better than passing full heap
>> objects and merge routine on array should run faster but needs experiment...
>>
>> On Thu, Mar 26, 2015 at 9:26 PM, Aung Htet  wrote:
>>
>>> Hi Debasish,
>>>
>>> Thanks for your suggestions. In-memory version is quite useful. I do not
>>> quite understand how you can use aggregateBy to get 10% top K elements. Can
>>> you please give an example?
>>>
>>> Thanks,
>>> Aung
>>>
>>> On Fri, Mar 27, 2015 at 2:40 PM, Debasish Das 
>>> wrote:
>>>
 You can do it in-memory as wellget 10% topK elements from each
 partition and use merge from any sort algorithm like timsortbasically
 aggregateBy

  Your version uses shuffle but this version is 0 shuffle..assuming
 your data set is cached you will be using in-memory allReduce through
 treeAggregate...

  But this is only good for top 10% or bottom 10%...if you need to do
 it for top 30% then may be the shuffle version will work better...

 On Thu, Mar 26, 2015 at 8:31 PM, Aung Htet  wrote:

> Hi all,
>
>  I have a distribution represented as an RDD of tuples, in rows of
> (segment, score)
> For each segment, I want to discard tuples with top X percent scores.
> This seems hard to do in Spark RDD.
>
>  A naive algorithm would be -
>
>  1) Sort RDD by segment & score (descending)
> 2) Within each segment, number the rows from top to bottom.
> 3) For each  segment, calculate the cut off index. i.e. 90 for 10% cut
> off out of a segment with 100 rows.
> 4) For the entire RDD, filter rows with row num <= cut off index
>
> This does not look like a good algorithm. I would really appreciate if
> someone can suggest a better way to implement this in Spark.
>
>  Regards,
> Aung
>


>>>
>>
>


Re: Spark Job Failed - Class not serializable

2015-04-03 Thread Deepak Jain
I meant that I did not have to use kyro. Why will kyro help fix this issue now ?

Sent from my iPhone

> On 03-Apr-2015, at 5:36 pm, Deepak Jain  wrote:
> 
> I was able to write record that extends specificrecord (avro) this class was 
> not auto generated. Do we need to do something extra for auto generated 
> classes 
> 
> Sent from my iPhone
> 
>> On 03-Apr-2015, at 5:06 pm, Akhil Das  wrote:
>> 
>> This thread might give you some insights 
>> http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201311.mbox/%3CCA+WVT8WXbEHac=N0GWxj-s9gqOkgG0VRL5B=ovjwexqm8ev...@mail.gmail.com%3E
>> 
>> Thanks
>> Best Regards
>> 
>>> On Fri, Apr 3, 2015 at 3:53 PM, ÐΞ€ρ@Ҝ (๏̯͡๏)  wrote:
>>> My Spark Job failed with
>>> 
>>> 
>>> 15/04/03 03:15:36 INFO scheduler.DAGScheduler: Job 0 failed: 
>>> saveAsNewAPIHadoopFile at AbstractInputHelper.scala:103, took 2.480175 s
>>> 15/04/03 03:15:36 ERROR yarn.ApplicationMaster: User class threw exception: 
>>> Job aborted due to stage failure: Task 0.0 in stage 2.0 (TID 0) had a not 
>>> serializable result: 
>>> com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
>>> Serialization stack:
>>> - object not serializable (class: 
>>> com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum, value: 
>>> {"userId": 0, "spsPrgrmId": 0, "spsSlrLevelCd": 0, "spsSlrLevelSumStartDt": 
>>> null, "spsSlrLevelSumEndDt": null, "currPsLvlId": null})
>>> - field (class: scala.Tuple2, name: _2, type: class java.lang.Object)
>>> - object (class scala.Tuple2, (0,{"userId": 0, "spsPrgrmId": 0, 
>>> "spsSlrLevelCd": 0, "spsSlrLevelSumStartDt": null, "spsSlrLevelSumEndDt": 
>>> null, "currPsLvlId": null}))
>>> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 
>>> in stage 2.0 (TID 0) had a not serializable result: 
>>> com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
>>> Serialization stack:
>>> - object not serializable (class: 
>>> com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum, value: 
>>> {"userId": 0, "spsPrgrmId": 0, "spsSlrLevelCd": 0, "spsSlrLevelSumStartDt": 
>>> null, "spsSlrLevelSumEndDt": null, "currPsLvlId": null})
>>> - field (class: scala.Tuple2, name: _2, type: class java.lang.Object)
>>> - object (class scala.Tuple2, (0,{"userId": 0, "spsPrgrmId": 0, 
>>> "spsSlrLevelCd": 0, "spsSlrLevelSumStartDt": null, "spsSlrLevelSumEndDt": 
>>> null, "currPsLvlId": null}))
>>> at 
>>> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
>>> at 
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
>>> at 
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
>>> at 
>>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>> at 
>>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
>>> at 
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
>>> at 
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
>>> at scala.Option.foreach(Option.scala:236)
>>> at 
>>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
>>> at 
>>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
>>> at 
>>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
>>> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>>> 15/04/03 03:15:36 INFO yarn.ApplicationMaster: Final app status: FAILED, 
>>> exitCode: 15, (reason: User class threw exception: Job aborted due to stage 
>>> failure: Task 0.0 in stage 2.0 (TID 0) had a not serializable result: 
>>> com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
>>> Serialization stack:
>>> 
>>> 
>>> 
>>> com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum is auto 
>>> generated through avro schema using avro-generate-sources maven pulgin.
>>> 
>>> 
>>> package com.ebay.ep.poc.spark.reporting.process.model.dw;  
>>> 
>>> @SuppressWarnings("all")
>>> 
>>> @org.apache.avro.specific.AvroGenerated
>>> 
>>> public class SpsLevelMetricSum extends 
>>> org.apache.avro.specific.SpecificRecordBase implements 
>>> org.apache.avro.specific.SpecificRecord {
>>> 
>>> ...
>>> ...
>>> }
>>> 
>>> Can anyone suggest how to fix this ?
>>> 
>>> 
>>> 
>>> -- 
>>> Deepak
>> 


Re: Spark Job Failed - Class not serializable

2015-04-03 Thread Deepak Jain
I was able to write record that extends specificrecord (avro) this class was 
not auto generated. Do we need to do something extra for auto generated classes 

Sent from my iPhone

> On 03-Apr-2015, at 5:06 pm, Akhil Das  wrote:
> 
> This thread might give you some insights 
> http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201311.mbox/%3CCA+WVT8WXbEHac=N0GWxj-s9gqOkgG0VRL5B=ovjwexqm8ev...@mail.gmail.com%3E
> 
> Thanks
> Best Regards
> 
>> On Fri, Apr 3, 2015 at 3:53 PM, ÐΞ€ρ@Ҝ (๏̯͡๏)  wrote:
>> My Spark Job failed with
>> 
>> 
>> 15/04/03 03:15:36 INFO scheduler.DAGScheduler: Job 0 failed: 
>> saveAsNewAPIHadoopFile at AbstractInputHelper.scala:103, took 2.480175 s
>> 15/04/03 03:15:36 ERROR yarn.ApplicationMaster: User class threw exception: 
>> Job aborted due to stage failure: Task 0.0 in stage 2.0 (TID 0) had a not 
>> serializable result: 
>> com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
>> Serialization stack:
>>  - object not serializable (class: 
>> com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum, value: 
>> {"userId": 0, "spsPrgrmId": 0, "spsSlrLevelCd": 0, "spsSlrLevelSumStartDt": 
>> null, "spsSlrLevelSumEndDt": null, "currPsLvlId": null})
>>  - field (class: scala.Tuple2, name: _2, type: class java.lang.Object)
>>  - object (class scala.Tuple2, (0,{"userId": 0, "spsPrgrmId": 0, 
>> "spsSlrLevelCd": 0, "spsSlrLevelSumStartDt": null, "spsSlrLevelSumEndDt": 
>> null, "currPsLvlId": null}))
>> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 
>> in stage 2.0 (TID 0) had a not serializable result: 
>> com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
>> Serialization stack:
>>  - object not serializable (class: 
>> com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum, value: 
>> {"userId": 0, "spsPrgrmId": 0, "spsSlrLevelCd": 0, "spsSlrLevelSumStartDt": 
>> null, "spsSlrLevelSumEndDt": null, "currPsLvlId": null})
>>  - field (class: scala.Tuple2, name: _2, type: class java.lang.Object)
>>  - object (class scala.Tuple2, (0,{"userId": 0, "spsPrgrmId": 0, 
>> "spsSlrLevelCd": 0, "spsSlrLevelSumStartDt": null, "spsSlrLevelSumEndDt": 
>> null, "currPsLvlId": null}))
>>  at 
>> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
>>  at 
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
>>  at 
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
>>  at 
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>  at 
>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
>>  at 
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
>>  at 
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
>>  at scala.Option.foreach(Option.scala:236)
>>  at 
>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
>>  at 
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
>>  at 
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
>>  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>> 15/04/03 03:15:36 INFO yarn.ApplicationMaster: Final app status: FAILED, 
>> exitCode: 15, (reason: User class threw exception: Job aborted due to stage 
>> failure: Task 0.0 in stage 2.0 (TID 0) had a not serializable result: 
>> com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
>> Serialization stack:
>> 
>> 
>> 
>> com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum is auto 
>> generated through avro schema using avro-generate-sources maven pulgin.
>> 
>> 
>> package com.ebay.ep.poc.spark.reporting.process.model.dw;  
>> 
>> @SuppressWarnings("all")
>> 
>> @org.apache.avro.specific.AvroGenerated
>> 
>> public class SpsLevelMetricSum extends 
>> org.apache.avro.specific.SpecificRecordBase implements 
>> org.apache.avro.specific.SpecificRecord {
>> 
>> ...
>> ...
>> }
>> 
>> Can anyone suggest how to fix this ?
>> 
>> 
>> 
>> -- 
>> Deepak
> 


Re: Tableau + Spark SQL Thrift Server + Cassandra

2015-04-03 Thread Todd Nist
What version of Cassandra are you using?  Are you using DSE or the stock
Apache Cassandra version?  I have connected it with DSE, but have not
attempted it with the standard Apache Cassandra version.

FWIW,
http://www.datastax.com/dev/blog/datastax-odbc-cql-connector-apache-cassandra-datastax-enterprise,
provides an ODBC driver tor accessing C* from Tableau.  Granted it does not
provide all the goodness of Spark.  Are you attempting to leverage the
spark-cassandra-connector for this?



On Thu, Apr 2, 2015 at 10:20 PM, Mohammed Guller 
wrote:

>  Hi –
>
>
>
> Is anybody using Tableau to analyze data in Cassandra through the Spark
> SQL Thrift Server?
>
>
>
> Thanks!
>
>
>
> Mohammed
>
>
>


Parquet timestamp support for Hive?

2015-04-03 Thread Rex Xiong
Hi,

I got this error when creating a hive table from parquet file:
DDLTask: org.apache.hadoop.hive.ql.metadata.HiveException:
java.lang.UnsupportedOperationException: Parquet does not support
timestamp. See HIVE-6384

I check HIVE-6384, it's fixed in 0.14.
The hive in spark build is a customized version 0.13.1a
(GroupId: org.spark-project.hive), is it possible to get the source code
for it and apply patch from HIVE-6384?

Thanks


Re: Which OS for Spark cluster nodes?

2015-04-03 Thread Akhil Das
There isn't any specific Linux distro, but i would prefer Ubuntu for a
beginner as its very easy to apt-get install stuffs on it.

Thanks
Best Regards

On Fri, Apr 3, 2015 at 4:58 PM, Horsmann, Tobias  wrote:

>  Hi,
> Are there any recommendations for operating systems that one should use
> for setting up Spark/Hadoop nodes in general?
> I am not familiar with the differences between the various linux
> distributions or how well they are (not) suited for cluster set-ups, so I
> wondered if there is some preferred choices?
>
>  Regards,
>
>


Re: Spark Job Failed - Class not serializable

2015-04-03 Thread Akhil Das
This thread might give you some insights
http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201311.mbox/%3CCA+WVT8WXbEHac=N0GWxj-s9gqOkgG0VRL5B=ovjwexqm8ev...@mail.gmail.com%3E

Thanks
Best Regards

On Fri, Apr 3, 2015 at 3:53 PM, ÐΞ€ρ@Ҝ (๏̯͡๏)  wrote:

> My Spark Job failed with
>
>
> 15/04/03 03:15:36 INFO scheduler.DAGScheduler: Job 0 failed:
> saveAsNewAPIHadoopFile at AbstractInputHelper.scala:103, took 2.480175 s
> 15/04/03 03:15:36 ERROR yarn.ApplicationMaster: User class threw
> exception: Job aborted due to stage failure: Task 0.0 in stage 2.0 (TID 0)
> had a not serializable result:
> com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
> Serialization stack:
> - object not serializable (class:
> com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum, value:
> {"userId": 0, "spsPrgrmId": 0, "spsSlrLevelCd": 0, "spsSlrLevelSumStartDt":
> null, "spsSlrLevelSumEndDt": null, "currPsLvlId": null})
> - field (class: scala.Tuple2, name: _2, type: class java.lang.Object)
> - object (class scala.Tuple2, (0,{"userId": 0, "spsPrgrmId": 0,
> "spsSlrLevelCd": 0, "spsSlrLevelSumStartDt": null, "spsSlrLevelSumEndDt":
> null, "currPsLvlId": null}))
> org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 0.0 in stage 2.0 (TID 0) had a not serializable result:
> com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
> Serialization stack:
> - object not serializable (class:
> com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum, value:
> {"userId": 0, "spsPrgrmId": 0, "spsSlrLevelCd": 0, "spsSlrLevelSumStartDt":
> null, "spsSlrLevelSumEndDt": null, "currPsLvlId": null})
> - field (class: scala.Tuple2, name: _2, type: class java.lang.Object)
> - object (class scala.Tuple2, (0,{"userId": 0, "spsPrgrmId": 0,
> "spsSlrLevelCd": 0, "spsSlrLevelSumStartDt": null, "spsSlrLevelSumEndDt":
> null, "currPsLvlId": null}))
> at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
> at scala.Option.foreach(Option.scala:236)
> at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> 15/04/03 03:15:36 INFO yarn.ApplicationMaster: Final app status: FAILED,
> exitCode: 15, (reason: User class threw exception: Job aborted due to stage
> failure: Task 0.0 in stage 2.0 (TID 0) had a not serializable result:
> com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
> Serialization stack:
>
> 
>
> com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum is auto
> generated through avro schema using avro-generate-sources maven pulgin.
>
>
> package com.ebay.ep.poc.spark.reporting.process.model.dw;
>
> @SuppressWarnings("all")
>
> @org.apache.avro.specific.AvroGenerated
>
> public class SpsLevelMetricSum extends
> org.apache.avro.specific.SpecificRecordBase implements
> org.apache.avro.specific.SpecificRecord {
> ...
> ...
> }
>
> Can anyone suggest how to fix this ?
>
>
>
> --
> Deepak
>
>


Which OS for Spark cluster nodes?

2015-04-03 Thread Horsmann, Tobias
Hi,
Are there any recommendations for operating systems that one should use for 
setting up Spark/Hadoop nodes in general?
I am not familiar with the differences between the various linux distributions 
or how well they are (not) suited for cluster set-ups, so I wondered if there 
is some preferred choices?

Regards,



Re: 答复:maven compile error

2015-04-03 Thread Ted Yu
Can you include -X in your maven command and pastebin the output ?

Cheers



> On Apr 3, 2015, at 3:58 AM, myelinji  wrote:
> 
> Thank you for your reply. When I'm using maven to compile the whole project, 
> the erros as follows
> 
> [INFO] Spark Project Parent POM .. SUCCESS [4.136s]
> [INFO] Spark Project Networking .. SUCCESS [7.405s]
> [INFO] Spark Project Shuffle Streaming Service ... SUCCESS [5.071s]
> [INFO] Spark Project Core  SUCCESS [3:08.445s]
> [INFO] Spark Project Bagel ... SUCCESS [21.613s]
> [INFO] Spark Project GraphX .. SUCCESS [58.915s]
> [INFO] Spark Project Streaming ... SUCCESS [1:26.202s]
> [INFO] Spark Project Catalyst  FAILURE [1.537s]
> [INFO] Spark Project SQL . SKIPPED
> [INFO] Spark Project ML Library .. SKIPPED
> [INFO] Spark Project Tools ... SKIPPED
> [INFO] Spark Project Hive  SKIPPED
> [INFO] Spark Project REPL  SKIPPED
> [INFO] Spark Project Assembly  SKIPPED
> [INFO] Spark Project External Twitter  SKIPPED
> [INFO] Spark Project External Flume Sink . SKIPPED
> [INFO] Spark Project External Flume .. SKIPPED
> [INFO] Spark Project External MQTT ... SKIPPED
> [INFO] Spark Project External ZeroMQ . SKIPPED
> [INFO] Spark Project External Kafka .. SKIPPED
> [INFO] Spark Project Examples  SKIPPED
> 
> it seems like there is something wrong with calatlyst project. Why i cannot 
> compile this project?
> 
> 
> --
> 发件人:Sean Owen 
> 发送时间:2015年4月3日(星期五) 17:48
> 收件人:myelinji 
> 抄 送:spark用户组 
> 主 题:Re: maven compile error
> 
> If you're asking about a compile error, you should include the command
> you used to compile.
> 
> I am able to compile branch 1.2 successfully with "mvn -DskipTests
> clean package".
> 
> This error is actually an error from scalac, not a compile error from
> the code. It sort of sounds like it has not been able to download
> scala dependencies. Check or maybe recreate your environment.
> 
> On Fri, Apr 3, 2015 at 3:19 AM, myelinji  wrote:
> > Hi,all:
> > Just now i checked out spark-1.2 on github , wanna to build it use maven,
> > how ever I encountered an error during compiling:
> >
> > [INFO]
> > 
> > [ERROR] Failed to execute goal
> > net.alchim31.maven:scala-maven-plugin:3.2.0:compile (scala-compile-first) on
> > project spark-catalyst_2.10: wrap:
> > scala.reflect.internal.MissingRequirementError: object scala.runtime in
> > compiler mirror not found. -> [Help 1]
> > org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
> > goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile
> > (scala-compile-first) on project spark-catalyst_2.10: wrap:
> > scala.reflect.internal.MissingRequirementError: object scala.runtime in
> > compiler mirror not found.
> > at
> > org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:217)
> > at
> > org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
> > at
> > org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
> > at
> > org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84)
> > at
> > org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59)
> > at
> > org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183)
> > at
> > org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161)
> > at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:320)
> > at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:156)
> > at org.apache.maven.cli.MavenCli.execute(MavenCli.java:537)
> > at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:196)
> > at org.apache.maven.cli.MavenCli.main(MavenCli.java:141)
> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > at
> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> > at
> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > at java.lang.reflect.Method.invoke(Method.java:606)
> > at
> > org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:290)
> > at
> > org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:230)
> > at
> > org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:409)
> > at org.codehaus.plexus.classworlds.launcher.Lau

Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

2015-04-03 Thread Todd Nist
Hi Deepujain,

I did include the jar file, I believe it is hive-exe.jar, through the
--jars option:

./bin/spark-shell --master spark://radtech.io:7077
--total-executor-cores 2 --driver-class-path
/usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar --jars
/opt/apache-hive-0.13.1-bin/lib/hive-exec-0.13.1.jar

Results in the same error.  I'm going to do the rebuild in a few minutes.

Thanks for the assistance.

-Todd



On Fri, Apr 3, 2015 at 6:30 AM, ÐΞ€ρ@Ҝ (๏̯͡๏)  wrote:

> I think you need to include the jar file through --jars option that
> contains the hive definition (code) of UDF json_tuple. That should solve
> your problem.
>
> On Fri, Apr 3, 2015 at 3:57 PM, Todd Nist  wrote:
>
>> I placed it there.  It was downloaded from MySql site.
>>
>> On Fri, Apr 3, 2015 at 6:25 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) 
>> wrote:
>>
>>> Akhil
>>> you mentioned /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar
>>> . how come you got this lib into spark/lib folder.
>>> 1) did you place it there ?
>>> 2) What is download location ?
>>>
>>>
>>> On Fri, Apr 3, 2015 at 3:42 PM, Todd Nist  wrote:
>>>
 Started the spark shell with the one jar from hive suggested:

 ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 
 2 --driver-class-path 
 /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar --jars 
 /opt/apache-hive-0.13.1-bin/lib/hive-exec-0.13.1.jar

 Results in the same error:

 scala> sql( | """SELECT path, name, value, v1.peValue, v1.peName   
   |  FROM metric_table |lateral view 
 json_tuple(pathElements, 'name', 'value') v1 |  as peName, 
 peValue | """)
 15/04/03 06:01:30 INFO ParseDriver: Parsing command: SELECT path, name, 
 value, v1.peValue, v1.peName FROM metric_table   lateral 
 view json_tuple(pathElements, 'name', 'value') v1 as peName, 
 peValue
 15/04/03 06:01:31 INFO ParseDriver: Parse Completed
 res2: org.apache.spark.sql.SchemaRDD =
 SchemaRDD[5] at RDD at SchemaRDD.scala:108== Query Plan  Physical Plan 
 ==
 java.lang.ClassNotFoundException: json_tuple

 I will try the rebuild.  Thanks again for the assistance.

 -Todd


 On Fri, Apr 3, 2015 at 5:34 AM, Akhil Das 
 wrote:

> Can you try building Spark
> 
> with hive support? Before that try to run the following:
>
> ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-
> cores 2 --driver-class-path /usr/local/spark/lib/mysql-connector-java-
> 5.1.34-bin.jar --jars /opt/hive/0.13.1/lib/hive-exec.jar
>
> Thanks
> Best Regards
>
> On Fri, Apr 3, 2015 at 2:55 PM, Todd Nist  wrote:
>
>> Hi Akhil,
>>
>> This is for version 1.2.1.  Well the other thread that you reference
>> was me attempting it in 1.3.0 to see if the issue was related to 1.2.1.  
>> I
>> did not build Spark but used the version from the Spark download site for
>> 1.2.1 Pre Built for Hadoop 2.4 or Later.
>>
>> Since I get the error in both 1.2.1 and 1.3.0,
>>
>> 15/04/01 14:41:49 INFO ParseDriver: Parse Completed Exception in
>> thread "main" java.lang.ClassNotFoundException: json_tuple at
>> java.net.URLClassLoader$1.run(
>>
>> It looks like I just don't have the jar.  Even including all jars in
>> the $HIVE/lib directory did not seem to work.  Though when looking in
>> $HIVE/lib for 0.13.1, I do not see any json serde or jackson files.  I do
>> see that hive-exec.jar contains
>> the org/apache/hadoop/hive/ql/udf/generic/GenericUDTFJSONTuple class.  Do
>> you know if there is another Jar that is required or should it work just 
>> by
>> including all jars from $HIVE/lib?
>>
>> I can build it locally, but did not think that was required based on
>> the version I downloaded; is that not the case?
>>
>> Thanks for the assistance.
>>
>> -Todd
>>
>>
>> On Fri, Apr 3, 2015 at 2:06 AM, Akhil Das > > wrote:
>>
>>> How did you build spark? which version of spark are you having?
>>> Doesn't this thread already explains it?
>>> https://www.mail-archive.com/user@spark.apache.org/msg25505.html
>>>
>>> Thanks
>>> Best Regards
>>>
>>> On Thu, Apr 2, 2015 at 11:10 PM, Todd Nist 
>>> wrote:
>>>
 Hi Akhil,

 Tried your suggestion to no avail.  I actually to not see and
 "jackson" or "json serde" jars in the $HIVE/lib directory.  This is 
 hive
 0.13.1 and spark 1.2.1

 Here is what I did:

 I have added the lib folder to the –jars option when starting the
 spark-shell,
 but the job fails. The hive-site.xml is in the

答复:maven compile error

2015-04-03 Thread myelinji
Thank you for your reply. When I'm using maven to compile the whole project, 
the erros as follows[INFO] Spark Project Parent POM .. 
SUCCESS [4.136s]
[INFO] Spark Project Networking .. SUCCESS [7.405s]
[INFO] Spark Project Shuffle Streaming Service ... SUCCESS [5.071s]
[INFO] Spark Project Core  SUCCESS [3:08.445s]
[INFO] Spark Project Bagel ... SUCCESS [21.613s]
[INFO] Spark Project GraphX .. SUCCESS [58.915s]
[INFO] Spark Project Streaming ... SUCCESS [1:26.202s]
[INFO] Spark Project Catalyst  FAILURE [1.537s]
[INFO] Spark Project SQL . SKIPPED
[INFO] Spark Project ML Library .. SKIPPED
[INFO] Spark Project Tools ... SKIPPED
[INFO] Spark Project Hive  SKIPPED
[INFO] Spark Project REPL  SKIPPED
[INFO] Spark Project Assembly  SKIPPED
[INFO] Spark Project External Twitter  SKIPPED
[INFO] Spark Project External Flume Sink . SKIPPED
[INFO] Spark Project External Flume .. SKIPPED
[INFO] Spark Project External MQTT ... SKIPPED
[INFO] Spark Project External ZeroMQ . SKIPPED
[INFO] Spark Project External Kafka .. SKIPPED
[INFO] Spark Project Examples  SKIPPED
it seems like there is something wrong with calatlyst project. Why i cannot 
compile this project?

--发件人:Sean Owen 
发送时间:2015年4月3日(星期五) 17:48收件人:myelinji 
抄 送:spark用户组 主 题:Re: maven compile 
error
If you're asking about a compile error, you should include the command
you used to compile.

I am able to compile branch 1.2 successfully with "mvn -DskipTests
clean package".

This error is actually an error from scalac, not a compile error from
the code. It sort of sounds like it has not been able to download
scala dependencies. Check or maybe recreate your environment.

On Fri, Apr 3, 2015 at 3:19 AM, myelinji  wrote:
> Hi,all:
>Just now i checked out spark-1.2 on github , wanna to build it use maven,
> how ever I encountered an error during compiling:
>
> [INFO]
> 
> [ERROR] Failed to execute goal
> net.alchim31.maven:scala-maven-plugin:3.2.0:compile (scala-compile-first) on
> project spark-catalyst_2.10: wrap:
> scala.reflect.internal.MissingRequirementError: object scala.runtime in
> compiler mirror not found. -> [Help 1]
> org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
> goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile
> (scala-compile-first) on project spark-catalyst_2.10: wrap:
> scala.reflect.internal.MissingRequirementError: object scala.runtime in
> compiler mirror not found.
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:217)
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
> at
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84)
> at
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59)
> at
> org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183)
> at
> org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161)
> at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:320)
> at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:156)
> at org.apache.maven.cli.MavenCli.execute(MavenCli.java:537)
> at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:196)
> at org.apache.maven.cli.MavenCli.main(MavenCli.java:141)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:290)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:230)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:409)
> at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:352)
> Caused by: org.apache.maven.plugin.MojoExecutionException: wrap:
> scala.reflect.internal.MissingRequirementError: object scala.runtime in
> compiler mirror not found.
> at scala_maven.ScalaMojoSupport.execute(ScalaMojoSupport.java:490)
> at
> org.apache.maven.plugin.DefaultBuildPluginMan

Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

2015-04-03 Thread ๏̯͡๏
I think you need to include the jar file through --jars option that
contains the hive definition (code) of UDF json_tuple. That should solve
your problem.

On Fri, Apr 3, 2015 at 3:57 PM, Todd Nist  wrote:

> I placed it there.  It was downloaded from MySql site.
>
> On Fri, Apr 3, 2015 at 6:25 AM, ÐΞ€ρ@Ҝ (๏̯͡๏)  wrote:
>
>> Akhil
>> you mentioned /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar .
>> how come you got this lib into spark/lib folder.
>> 1) did you place it there ?
>> 2) What is download location ?
>>
>>
>> On Fri, Apr 3, 2015 at 3:42 PM, Todd Nist  wrote:
>>
>>> Started the spark shell with the one jar from hive suggested:
>>>
>>> ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 2 
>>> --driver-class-path 
>>> /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar --jars 
>>> /opt/apache-hive-0.13.1-bin/lib/hive-exec-0.13.1.jar
>>>
>>> Results in the same error:
>>>
>>> scala> sql( | """SELECT path, name, value, v1.peValue, v1.peName
>>>  |  FROM metric_table |lateral view 
>>> json_tuple(pathElements, 'name', 'value') v1 |  as peName, 
>>> peValue | """)
>>> 15/04/03 06:01:30 INFO ParseDriver: Parsing command: SELECT path, name, 
>>> value, v1.peValue, v1.peName FROM metric_table   lateral 
>>> view json_tuple(pathElements, 'name', 'value') v1 as peName, 
>>> peValue
>>> 15/04/03 06:01:31 INFO ParseDriver: Parse Completed
>>> res2: org.apache.spark.sql.SchemaRDD =
>>> SchemaRDD[5] at RDD at SchemaRDD.scala:108== Query Plan  Physical Plan 
>>> ==
>>> java.lang.ClassNotFoundException: json_tuple
>>>
>>> I will try the rebuild.  Thanks again for the assistance.
>>>
>>> -Todd
>>>
>>>
>>> On Fri, Apr 3, 2015 at 5:34 AM, Akhil Das 
>>> wrote:
>>>
 Can you try building Spark
 
 with hive support? Before that try to run the following:

 ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-
 cores 2 --driver-class-path /usr/local/spark/lib/mysql-connector-java-5
 .1.34-bin.jar --jars /opt/hive/0.13.1/lib/hive-exec.jar

 Thanks
 Best Regards

 On Fri, Apr 3, 2015 at 2:55 PM, Todd Nist  wrote:

> Hi Akhil,
>
> This is for version 1.2.1.  Well the other thread that you reference
> was me attempting it in 1.3.0 to see if the issue was related to 1.2.1.  I
> did not build Spark but used the version from the Spark download site for
> 1.2.1 Pre Built for Hadoop 2.4 or Later.
>
> Since I get the error in both 1.2.1 and 1.3.0,
>
> 15/04/01 14:41:49 INFO ParseDriver: Parse Completed Exception in
> thread "main" java.lang.ClassNotFoundException: json_tuple at
> java.net.URLClassLoader$1.run(
>
> It looks like I just don't have the jar.  Even including all jars in
> the $HIVE/lib directory did not seem to work.  Though when looking in
> $HIVE/lib for 0.13.1, I do not see any json serde or jackson files.  I do
> see that hive-exec.jar contains
> the org/apache/hadoop/hive/ql/udf/generic/GenericUDTFJSONTuple class.  Do
> you know if there is another Jar that is required or should it work just 
> by
> including all jars from $HIVE/lib?
>
> I can build it locally, but did not think that was required based on
> the version I downloaded; is that not the case?
>
> Thanks for the assistance.
>
> -Todd
>
>
> On Fri, Apr 3, 2015 at 2:06 AM, Akhil Das 
> wrote:
>
>> How did you build spark? which version of spark are you having?
>> Doesn't this thread already explains it?
>> https://www.mail-archive.com/user@spark.apache.org/msg25505.html
>>
>> Thanks
>> Best Regards
>>
>> On Thu, Apr 2, 2015 at 11:10 PM, Todd Nist 
>> wrote:
>>
>>> Hi Akhil,
>>>
>>> Tried your suggestion to no avail.  I actually to not see and
>>> "jackson" or "json serde" jars in the $HIVE/lib directory.  This is hive
>>> 0.13.1 and spark 1.2.1
>>>
>>> Here is what I did:
>>>
>>> I have added the lib folder to the –jars option when starting the
>>> spark-shell,
>>> but the job fails. The hive-site.xml is in the $SPARK_HOME/conf
>>> directory.
>>>
>>> I start the spark-shell as follows:
>>>
>>> ./bin/spark-shell --master spark://radtech.io:7077 
>>> --total-executor-cores 2 --driver-class-path 
>>> /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar
>>>
>>> and like this
>>>
>>> ./bin/spark-shell --master spark://radtech.io:7077 
>>> --total-executor-cores 2 --driver-class-path 
>>> /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar --jars 
>>> /opt/hive/0.13.1/lib/*
>>>
>>> I’m just doing this in the spark-shell now:
>>>
>>> import org.apach

Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

2015-04-03 Thread Todd Nist
I placed it there.  It was downloaded from MySql site.

On Fri, Apr 3, 2015 at 6:25 AM, ÐΞ€ρ@Ҝ (๏̯͡๏)  wrote:

> Akhil
> you mentioned /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar .
> how come you got this lib into spark/lib folder.
> 1) did you place it there ?
> 2) What is download location ?
>
>
> On Fri, Apr 3, 2015 at 3:42 PM, Todd Nist  wrote:
>
>> Started the spark shell with the one jar from hive suggested:
>>
>> ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 2 
>> --driver-class-path /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar 
>> --jars /opt/apache-hive-0.13.1-bin/lib/hive-exec-0.13.1.jar
>>
>> Results in the same error:
>>
>> scala> sql( | """SELECT path, name, value, v1.peValue, v1.peName 
>> |  FROM metric_table |lateral view 
>> json_tuple(pathElements, 'name', 'value') v1 |  as peName, 
>> peValue | """)
>> 15/04/03 06:01:30 INFO ParseDriver: Parsing command: SELECT path, name, 
>> value, v1.peValue, v1.peName FROM metric_table   lateral 
>> view json_tuple(pathElements, 'name', 'value') v1 as peName, 
>> peValue
>> 15/04/03 06:01:31 INFO ParseDriver: Parse Completed
>> res2: org.apache.spark.sql.SchemaRDD =
>> SchemaRDD[5] at RDD at SchemaRDD.scala:108== Query Plan  Physical Plan ==
>> java.lang.ClassNotFoundException: json_tuple
>>
>> I will try the rebuild.  Thanks again for the assistance.
>>
>> -Todd
>>
>>
>> On Fri, Apr 3, 2015 at 5:34 AM, Akhil Das 
>> wrote:
>>
>>> Can you try building Spark
>>> 
>>> with hive support? Before that try to run the following:
>>>
>>> ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-
>>> cores 2 --driver-class-path /usr/local/spark/lib/mysql-connector-java-5.
>>> 1.34-bin.jar --jars /opt/hive/0.13.1/lib/hive-exec.jar
>>>
>>> Thanks
>>> Best Regards
>>>
>>> On Fri, Apr 3, 2015 at 2:55 PM, Todd Nist  wrote:
>>>
 Hi Akhil,

 This is for version 1.2.1.  Well the other thread that you reference
 was me attempting it in 1.3.0 to see if the issue was related to 1.2.1.  I
 did not build Spark but used the version from the Spark download site for
 1.2.1 Pre Built for Hadoop 2.4 or Later.

 Since I get the error in both 1.2.1 and 1.3.0,

 15/04/01 14:41:49 INFO ParseDriver: Parse Completed Exception in
 thread "main" java.lang.ClassNotFoundException: json_tuple at
 java.net.URLClassLoader$1.run(

 It looks like I just don't have the jar.  Even including all jars in
 the $HIVE/lib directory did not seem to work.  Though when looking in
 $HIVE/lib for 0.13.1, I do not see any json serde or jackson files.  I do
 see that hive-exec.jar contains
 the org/apache/hadoop/hive/ql/udf/generic/GenericUDTFJSONTuple class.  Do
 you know if there is another Jar that is required or should it work just by
 including all jars from $HIVE/lib?

 I can build it locally, but did not think that was required based on
 the version I downloaded; is that not the case?

 Thanks for the assistance.

 -Todd


 On Fri, Apr 3, 2015 at 2:06 AM, Akhil Das 
 wrote:

> How did you build spark? which version of spark are you having?
> Doesn't this thread already explains it?
> https://www.mail-archive.com/user@spark.apache.org/msg25505.html
>
> Thanks
> Best Regards
>
> On Thu, Apr 2, 2015 at 11:10 PM, Todd Nist  wrote:
>
>> Hi Akhil,
>>
>> Tried your suggestion to no avail.  I actually to not see and
>> "jackson" or "json serde" jars in the $HIVE/lib directory.  This is hive
>> 0.13.1 and spark 1.2.1
>>
>> Here is what I did:
>>
>> I have added the lib folder to the –jars option when starting the
>> spark-shell,
>> but the job fails. The hive-site.xml is in the $SPARK_HOME/conf
>> directory.
>>
>> I start the spark-shell as follows:
>>
>> ./bin/spark-shell --master spark://radtech.io:7077 
>> --total-executor-cores 2 --driver-class-path 
>> /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar
>>
>> and like this
>>
>> ./bin/spark-shell --master spark://radtech.io:7077 
>> --total-executor-cores 2 --driver-class-path 
>> /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar --jars 
>> /opt/hive/0.13.1/lib/*
>>
>> I’m just doing this in the spark-shell now:
>>
>> import org.apache.spark.sql.hive._val sqlContext = new 
>> HiveContext(sc)import sqlContext._case class MetricTable(path: String, 
>> pathElements: String, name: String, value: String)val mt = new 
>> MetricTable("""path": "/DC1/HOST1/""",
>> """pathElements": [{"node": "DataCenter","value": "DC1"},{"node": 
>> "host","value": "HOST1"}]""",
>> 

Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

2015-04-03 Thread Akhil Das
Copy pasted his command in the same thread.

Thanks
Best Regards

On Fri, Apr 3, 2015 at 3:55 PM, ÐΞ€ρ@Ҝ (๏̯͡๏)  wrote:

> Akhil
> you mentioned /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar .
> how come you got this lib into spark/lib folder.
> 1) did you place it there ?
> 2) What is download location ?
>
>
> On Fri, Apr 3, 2015 at 3:42 PM, Todd Nist  wrote:
>
>> Started the spark shell with the one jar from hive suggested:
>>
>> ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 2 
>> --driver-class-path /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar 
>> --jars /opt/apache-hive-0.13.1-bin/lib/hive-exec-0.13.1.jar
>>
>> Results in the same error:
>>
>> scala> sql( | """SELECT path, name, value, v1.peValue, v1.peName 
>> |  FROM metric_table |lateral view 
>> json_tuple(pathElements, 'name', 'value') v1 |  as peName, 
>> peValue | """)
>> 15/04/03 06:01:30 INFO ParseDriver: Parsing command: SELECT path, name, 
>> value, v1.peValue, v1.peName FROM metric_table   lateral 
>> view json_tuple(pathElements, 'name', 'value') v1 as peName, 
>> peValue
>> 15/04/03 06:01:31 INFO ParseDriver: Parse Completed
>> res2: org.apache.spark.sql.SchemaRDD =
>> SchemaRDD[5] at RDD at SchemaRDD.scala:108== Query Plan  Physical Plan ==
>> java.lang.ClassNotFoundException: json_tuple
>>
>> I will try the rebuild.  Thanks again for the assistance.
>>
>> -Todd
>>
>>
>> On Fri, Apr 3, 2015 at 5:34 AM, Akhil Das 
>> wrote:
>>
>>> Can you try building Spark
>>> 
>>> with hive support? Before that try to run the following:
>>>
>>> ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-
>>> cores 2 --driver-class-path /usr/local/spark/lib/mysql-connector-java-5.
>>> 1.34-bin.jar --jars /opt/hive/0.13.1/lib/hive-exec.jar
>>>
>>> Thanks
>>> Best Regards
>>>
>>> On Fri, Apr 3, 2015 at 2:55 PM, Todd Nist  wrote:
>>>
 Hi Akhil,

 This is for version 1.2.1.  Well the other thread that you reference
 was me attempting it in 1.3.0 to see if the issue was related to 1.2.1.  I
 did not build Spark but used the version from the Spark download site for
 1.2.1 Pre Built for Hadoop 2.4 or Later.

 Since I get the error in both 1.2.1 and 1.3.0,

 15/04/01 14:41:49 INFO ParseDriver: Parse Completed Exception in
 thread "main" java.lang.ClassNotFoundException: json_tuple at
 java.net.URLClassLoader$1.run(

 It looks like I just don't have the jar.  Even including all jars in
 the $HIVE/lib directory did not seem to work.  Though when looking in
 $HIVE/lib for 0.13.1, I do not see any json serde or jackson files.  I do
 see that hive-exec.jar contains
 the org/apache/hadoop/hive/ql/udf/generic/GenericUDTFJSONTuple class.  Do
 you know if there is another Jar that is required or should it work just by
 including all jars from $HIVE/lib?

 I can build it locally, but did not think that was required based on
 the version I downloaded; is that not the case?

 Thanks for the assistance.

 -Todd


 On Fri, Apr 3, 2015 at 2:06 AM, Akhil Das 
 wrote:

> How did you build spark? which version of spark are you having?
> Doesn't this thread already explains it?
> https://www.mail-archive.com/user@spark.apache.org/msg25505.html
>
> Thanks
> Best Regards
>
> On Thu, Apr 2, 2015 at 11:10 PM, Todd Nist  wrote:
>
>> Hi Akhil,
>>
>> Tried your suggestion to no avail.  I actually to not see and
>> "jackson" or "json serde" jars in the $HIVE/lib directory.  This is hive
>> 0.13.1 and spark 1.2.1
>>
>> Here is what I did:
>>
>> I have added the lib folder to the –jars option when starting the
>> spark-shell,
>> but the job fails. The hive-site.xml is in the $SPARK_HOME/conf
>> directory.
>>
>> I start the spark-shell as follows:
>>
>> ./bin/spark-shell --master spark://radtech.io:7077 
>> --total-executor-cores 2 --driver-class-path 
>> /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar
>>
>> and like this
>>
>> ./bin/spark-shell --master spark://radtech.io:7077 
>> --total-executor-cores 2 --driver-class-path 
>> /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar --jars 
>> /opt/hive/0.13.1/lib/*
>>
>> I’m just doing this in the spark-shell now:
>>
>> import org.apache.spark.sql.hive._val sqlContext = new 
>> HiveContext(sc)import sqlContext._case class MetricTable(path: String, 
>> pathElements: String, name: String, value: String)val mt = new 
>> MetricTable("""path": "/DC1/HOST1/""",
>> """pathElements": [{"node": "DataCenter","value": "DC1"},{"node": 
>> "host","value": "HOST1"}]""

Spark Job Failed - Class not serializable

2015-04-03 Thread ๏̯͡๏
My Spark Job failed with


15/04/03 03:15:36 INFO scheduler.DAGScheduler: Job 0 failed:
saveAsNewAPIHadoopFile at AbstractInputHelper.scala:103, took 2.480175 s
15/04/03 03:15:36 ERROR yarn.ApplicationMaster: User class threw exception:
Job aborted due to stage failure: Task 0.0 in stage 2.0 (TID 0) had a not
serializable result:
com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
Serialization stack:
- object not serializable (class:
com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum, value:
{"userId": 0, "spsPrgrmId": 0, "spsSlrLevelCd": 0, "spsSlrLevelSumStartDt":
null, "spsSlrLevelSumEndDt": null, "currPsLvlId": null})
- field (class: scala.Tuple2, name: _2, type: class java.lang.Object)
- object (class scala.Tuple2, (0,{"userId": 0, "spsPrgrmId": 0,
"spsSlrLevelCd": 0, "spsSlrLevelSumStartDt": null, "spsSlrLevelSumEndDt":
null, "currPsLvlId": null}))
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0
in stage 2.0 (TID 0) had a not serializable result:
com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
Serialization stack:
- object not serializable (class:
com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum, value:
{"userId": 0, "spsPrgrmId": 0, "spsSlrLevelCd": 0, "spsSlrLevelSumStartDt":
null, "spsSlrLevelSumEndDt": null, "currPsLvlId": null})
- field (class: scala.Tuple2, name: _2, type: class java.lang.Object)
- object (class scala.Tuple2, (0,{"userId": 0, "spsPrgrmId": 0,
"spsSlrLevelCd": 0, "spsSlrLevelSumStartDt": null, "spsSlrLevelSumEndDt":
null, "currPsLvlId": null}))
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
15/04/03 03:15:36 INFO yarn.ApplicationMaster: Final app status: FAILED,
exitCode: 15, (reason: User class threw exception: Job aborted due to stage
failure: Task 0.0 in stage 2.0 (TID 0) had a not serializable result:
com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum
Serialization stack:



com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum is auto
generated through avro schema using avro-generate-sources maven pulgin.


package com.ebay.ep.poc.spark.reporting.process.model.dw;

@SuppressWarnings("all")

@org.apache.avro.specific.AvroGenerated

public class SpsLevelMetricSum extends
org.apache.avro.specific.SpecificRecordBase implements
org.apache.avro.specific.SpecificRecord {
...
...
}

Can anyone suggest how to fix this ?



-- 
Deepak


Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

2015-04-03 Thread ๏̯͡๏
Akhil
you mentioned /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar .
how come you got this lib into spark/lib folder.
1) did you place it there ?
2) What is download location ?


On Fri, Apr 3, 2015 at 3:42 PM, Todd Nist  wrote:

> Started the spark shell with the one jar from hive suggested:
>
> ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 2 
> --driver-class-path /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar 
> --jars /opt/apache-hive-0.13.1-bin/lib/hive-exec-0.13.1.jar
>
> Results in the same error:
>
> scala> sql( | """SELECT path, name, value, v1.peValue, v1.peName 
> |  FROM metric_table |lateral view 
> json_tuple(pathElements, 'name', 'value') v1 |  as peName, 
> peValue | """)
> 15/04/03 06:01:30 INFO ParseDriver: Parsing command: SELECT path, name, 
> value, v1.peValue, v1.peName FROM metric_table   lateral view 
> json_tuple(pathElements, 'name', 'value') v1 as peName, peValue
> 15/04/03 06:01:31 INFO ParseDriver: Parse Completed
> res2: org.apache.spark.sql.SchemaRDD =
> SchemaRDD[5] at RDD at SchemaRDD.scala:108== Query Plan  Physical Plan ==
> java.lang.ClassNotFoundException: json_tuple
>
> I will try the rebuild.  Thanks again for the assistance.
>
> -Todd
>
>
> On Fri, Apr 3, 2015 at 5:34 AM, Akhil Das 
> wrote:
>
>> Can you try building Spark
>> 
>> with hive support? Before that try to run the following:
>>
>> ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores
>> 2 --driver-class-path /usr/local/spark/lib/mysql-connector-java-5.1.34-
>> bin.jar --jars /opt/hive/0.13.1/lib/hive-exec.jar
>>
>> Thanks
>> Best Regards
>>
>> On Fri, Apr 3, 2015 at 2:55 PM, Todd Nist  wrote:
>>
>>> Hi Akhil,
>>>
>>> This is for version 1.2.1.  Well the other thread that you reference was
>>> me attempting it in 1.3.0 to see if the issue was related to 1.2.1.  I did
>>> not build Spark but used the version from the Spark download site for 1.2.1
>>> Pre Built for Hadoop 2.4 or Later.
>>>
>>> Since I get the error in both 1.2.1 and 1.3.0,
>>>
>>> 15/04/01 14:41:49 INFO ParseDriver: Parse Completed Exception in thread
>>> "main" java.lang.ClassNotFoundException: json_tuple at
>>> java.net.URLClassLoader$1.run(
>>>
>>> It looks like I just don't have the jar.  Even including all jars in the
>>> $HIVE/lib directory did not seem to work.  Though when looking in $HIVE/lib
>>> for 0.13.1, I do not see any json serde or jackson files.  I do see that
>>> hive-exec.jar contains
>>> the org/apache/hadoop/hive/ql/udf/generic/GenericUDTFJSONTuple class.  Do
>>> you know if there is another Jar that is required or should it work just by
>>> including all jars from $HIVE/lib?
>>>
>>> I can build it locally, but did not think that was required based on the
>>> version I downloaded; is that not the case?
>>>
>>> Thanks for the assistance.
>>>
>>> -Todd
>>>
>>>
>>> On Fri, Apr 3, 2015 at 2:06 AM, Akhil Das 
>>> wrote:
>>>
 How did you build spark? which version of spark are you having? Doesn't
 this thread already explains it?
 https://www.mail-archive.com/user@spark.apache.org/msg25505.html

 Thanks
 Best Regards

 On Thu, Apr 2, 2015 at 11:10 PM, Todd Nist  wrote:

> Hi Akhil,
>
> Tried your suggestion to no avail.  I actually to not see and
> "jackson" or "json serde" jars in the $HIVE/lib directory.  This is hive
> 0.13.1 and spark 1.2.1
>
> Here is what I did:
>
> I have added the lib folder to the –jars option when starting the
> spark-shell,
> but the job fails. The hive-site.xml is in the $SPARK_HOME/conf
> directory.
>
> I start the spark-shell as follows:
>
> ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 
> 2 --driver-class-path 
> /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar
>
> and like this
>
> ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 
> 2 --driver-class-path 
> /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar --jars 
> /opt/hive/0.13.1/lib/*
>
> I’m just doing this in the spark-shell now:
>
> import org.apache.spark.sql.hive._val sqlContext = new 
> HiveContext(sc)import sqlContext._case class MetricTable(path: String, 
> pathElements: String, name: String, value: String)val mt = new 
> MetricTable("""path": "/DC1/HOST1/""",
> """pathElements": [{"node": "DataCenter","value": "DC1"},{"node": 
> "host","value": "HOST1"}]""",
> """name": "Memory Usage (%)""",
> """value": 29.590943279257175""")val rdd1 = sc.makeRDD(List(mt))
> rdd1.printSchema()
> rdd1.registerTempTable("metric_table")
> sql(
> """SELECT path, name, value, v1.peValue, v1.pe

Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

2015-04-03 Thread Todd Nist
Started the spark shell with the one jar from hive suggested:

./bin/spark-shell --master spark://radtech.io:7077
--total-executor-cores 2 --driver-class-path
/usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar --jars
/opt/apache-hive-0.13.1-bin/lib/hive-exec-0.13.1.jar

Results in the same error:

scala> sql( | """SELECT path, name, value, v1.peValue,
v1.peName |  FROM metric_table |lateral
view json_tuple(pathElements, 'name', 'value') v1 |
as peName, peValue | """)
15/04/03 06:01:30 INFO ParseDriver: Parsing command: SELECT path,
name, value, v1.peValue, v1.peName FROM metric_table
lateral view json_tuple(pathElements, 'name', 'value') v1
as peName, peValue
15/04/03 06:01:31 INFO ParseDriver: Parse Completed
res2: org.apache.spark.sql.SchemaRDD =
SchemaRDD[5] at RDD at SchemaRDD.scala:108== Query Plan  Physical Plan ==
java.lang.ClassNotFoundException: json_tuple

I will try the rebuild.  Thanks again for the assistance.

-Todd


On Fri, Apr 3, 2015 at 5:34 AM, Akhil Das 
wrote:

> Can you try building Spark
> 
> with hive support? Before that try to run the following:
>
> ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores
> 2 --driver-class-path /usr/local/spark/lib/mysql-connector-java-5.1.34-bin
> .jar --jars /opt/hive/0.13.1/lib/hive-exec.jar
>
> Thanks
> Best Regards
>
> On Fri, Apr 3, 2015 at 2:55 PM, Todd Nist  wrote:
>
>> Hi Akhil,
>>
>> This is for version 1.2.1.  Well the other thread that you reference was
>> me attempting it in 1.3.0 to see if the issue was related to 1.2.1.  I did
>> not build Spark but used the version from the Spark download site for 1.2.1
>> Pre Built for Hadoop 2.4 or Later.
>>
>> Since I get the error in both 1.2.1 and 1.3.0,
>>
>> 15/04/01 14:41:49 INFO ParseDriver: Parse Completed Exception in thread
>> "main" java.lang.ClassNotFoundException: json_tuple at
>> java.net.URLClassLoader$1.run(
>>
>> It looks like I just don't have the jar.  Even including all jars in the
>> $HIVE/lib directory did not seem to work.  Though when looking in $HIVE/lib
>> for 0.13.1, I do not see any json serde or jackson files.  I do see that
>> hive-exec.jar contains
>> the org/apache/hadoop/hive/ql/udf/generic/GenericUDTFJSONTuple class.  Do
>> you know if there is another Jar that is required or should it work just by
>> including all jars from $HIVE/lib?
>>
>> I can build it locally, but did not think that was required based on the
>> version I downloaded; is that not the case?
>>
>> Thanks for the assistance.
>>
>> -Todd
>>
>>
>> On Fri, Apr 3, 2015 at 2:06 AM, Akhil Das 
>> wrote:
>>
>>> How did you build spark? which version of spark are you having? Doesn't
>>> this thread already explains it?
>>> https://www.mail-archive.com/user@spark.apache.org/msg25505.html
>>>
>>> Thanks
>>> Best Regards
>>>
>>> On Thu, Apr 2, 2015 at 11:10 PM, Todd Nist  wrote:
>>>
 Hi Akhil,

 Tried your suggestion to no avail.  I actually to not see and "jackson"
 or "json serde" jars in the $HIVE/lib directory.  This is hive 0.13.1 and
 spark 1.2.1

 Here is what I did:

 I have added the lib folder to the –jars option when starting the
 spark-shell,
 but the job fails. The hive-site.xml is in the $SPARK_HOME/conf
 directory.

 I start the spark-shell as follows:

 ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 
 2 --driver-class-path 
 /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar

 and like this

 ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 
 2 --driver-class-path 
 /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar --jars 
 /opt/hive/0.13.1/lib/*

 I’m just doing this in the spark-shell now:

 import org.apache.spark.sql.hive._val sqlContext = new 
 HiveContext(sc)import sqlContext._case class MetricTable(path: String, 
 pathElements: String, name: String, value: String)val mt = new 
 MetricTable("""path": "/DC1/HOST1/""",
 """pathElements": [{"node": "DataCenter","value": "DC1"},{"node": 
 "host","value": "HOST1"}]""",
 """name": "Memory Usage (%)""",
 """value": 29.590943279257175""")val rdd1 = sc.makeRDD(List(mt))
 rdd1.printSchema()
 rdd1.registerTempTable("metric_table")
 sql(
 """SELECT path, name, value, v1.peValue, v1.peName
  FROM metric_table
lateral view json_tuple(pathElements, 'name', 'value') v1
  as peName, peValue
 """)
 .collect.foreach(println(_))

 It results in the same error:

 15/04/02 12:33:59 INFO ParseDriver: Parsing command: SELECT path, name, 
 value, v1.peValue, v1.peName FROM metric_table   lateral 
 view json_tuple(p

Re: Data locality across jobs

2015-04-03 Thread Ajay Srivastava
You can read same partition from every hour's output, union these RDDs and then 
repartition them as a single partition. This will be done for all partitions 
one by one. It may not necessarily improve the performance, will depend on size 
of spills in job when all the data was processed together.
Regards,Ajay
 


 On Friday, April 3, 2015 2:01 AM, Sandy Ryza  
wrote:
   

 This isn't currently a capability that Spark has, though it has definitely 
been discussed: https://issues.apache.org/jira/browse/SPARK-1061.  The primary 
obstacle at this point is that Hadoop's FileInputFormat doesn't guarantee that 
each file corresponds to a single split, so the records corresponding to a 
particular partition at the end of the first job can end up split across 
multiple partitions in the second job.
-Sandy
On Wed, Apr 1, 2015 at 9:09 PM, kjsingh  wrote:

Hi,

We are running an hourly job using Spark 1.2 on Yarn. It saves an RDD of
Tuple2. At the end of day, a daily job is launched, which works on the
outputs of the hourly jobs.

For data locality and speed, we wish that when the daily job launches, it
finds all instances of a given key at a single executor rather than fetching
it from others during shuffle.

Is it possible to maintain key partitioning across jobs? We can control
partitioning in one job. But how do we send keys to the executors of same
node manager across jobs? And while saving data to HDFS, are the blocks
allocated to the same data node machine as the executor for a partition?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Data-locality-across-jobs-tp22351.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





  

Re: maven compile error

2015-04-03 Thread Sean Owen
If you're asking about a compile error, you should include the command
you used to compile.

I am able to compile branch 1.2 successfully with "mvn -DskipTests
clean package".

This error is actually an error from scalac, not a compile error from
the code. It sort of sounds like it has not been able to download
scala dependencies. Check or maybe recreate your environment.

On Fri, Apr 3, 2015 at 3:19 AM, myelinji  wrote:
> Hi,all:
>Just now i checked out spark-1.2 on github , wanna to build it use maven,
> how ever I encountered an error during compiling:
>
> [INFO]
> 
> [ERROR] Failed to execute goal
> net.alchim31.maven:scala-maven-plugin:3.2.0:compile (scala-compile-first) on
> project spark-catalyst_2.10: wrap:
> scala.reflect.internal.MissingRequirementError: object scala.runtime in
> compiler mirror not found. -> [Help 1]
> org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
> goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile
> (scala-compile-first) on project spark-catalyst_2.10: wrap:
> scala.reflect.internal.MissingRequirementError: object scala.runtime in
> compiler mirror not found.
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:217)
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
> at
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84)
> at
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59)
> at
> org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183)
> at
> org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161)
> at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:320)
> at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:156)
> at org.apache.maven.cli.MavenCli.execute(MavenCli.java:537)
> at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:196)
> at org.apache.maven.cli.MavenCli.main(MavenCli.java:141)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:290)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:230)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:409)
> at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:352)
> Caused by: org.apache.maven.plugin.MojoExecutionException: wrap:
> scala.reflect.internal.MissingRequirementError: object scala.runtime in
> compiler mirror not found.
> at scala_maven.ScalaMojoSupport.execute(ScalaMojoSupport.java:490)
> at
> org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:101)
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:209)
> ... 19 more
> Caused by: scala.reflect.internal.MissingRequirementError: object
> scala.runtime in compiler mirror not found.
> at
> scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
> at
> scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
> at
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
> at
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
> at
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
> at scala.reflect.internal.Mirrors$RootsBase.getPackage(Mirrors.scala:172)
> at
> scala.reflect.internal.Mirrors$RootsBase.getRequiredPackage(Mirrors.scala:175)
> at
> scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackage$lzycompute(Definitions.scala:183)
> at
> scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackage(Definitions.scala:183)
> at
> scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackageClass$lzycompute(Definitions.scala:184)
> at
> scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackageClass(Definitions.scala:184)
> at
> scala.reflect.internal.Definitions$DefinitionsClass.AnnotationDefaultAttr$lzycompute(Definitions.scala:1024)
> at
> scala.reflect.internal.Definitions$DefinitionsClass.AnnotationDefaultAttr(Definitions.scala:1023)
> at
> scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses$lzycompute(Definitions.scala:1153)
> at
> scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses(Definitions.scala:1152)
> at
> scala.reflect.internal.Definitions$DefinitionsClass.symbolsNotPresentInBytecode$lzycompute(Definitions.scala

Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

2015-04-03 Thread Akhil Das
Can you try building Spark

with hive support? Before that try to run the following:

./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 2
--driver-class-path /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar
--jars /opt/hive/0.13.1/lib/hive-exec.jar

Thanks
Best Regards

On Fri, Apr 3, 2015 at 2:55 PM, Todd Nist  wrote:

> Hi Akhil,
>
> This is for version 1.2.1.  Well the other thread that you reference was
> me attempting it in 1.3.0 to see if the issue was related to 1.2.1.  I did
> not build Spark but used the version from the Spark download site for 1.2.1
> Pre Built for Hadoop 2.4 or Later.
>
> Since I get the error in both 1.2.1 and 1.3.0,
>
> 15/04/01 14:41:49 INFO ParseDriver: Parse Completed Exception in thread
> "main" java.lang.ClassNotFoundException: json_tuple at
> java.net.URLClassLoader$1.run(
>
> It looks like I just don't have the jar.  Even including all jars in the
> $HIVE/lib directory did not seem to work.  Though when looking in $HIVE/lib
> for 0.13.1, I do not see any json serde or jackson files.  I do see that
> hive-exec.jar contains
> the org/apache/hadoop/hive/ql/udf/generic/GenericUDTFJSONTuple class.  Do
> you know if there is another Jar that is required or should it work just by
> including all jars from $HIVE/lib?
>
> I can build it locally, but did not think that was required based on the
> version I downloaded; is that not the case?
>
> Thanks for the assistance.
>
> -Todd
>
>
> On Fri, Apr 3, 2015 at 2:06 AM, Akhil Das 
> wrote:
>
>> How did you build spark? which version of spark are you having? Doesn't
>> this thread already explains it?
>> https://www.mail-archive.com/user@spark.apache.org/msg25505.html
>>
>> Thanks
>> Best Regards
>>
>> On Thu, Apr 2, 2015 at 11:10 PM, Todd Nist  wrote:
>>
>>> Hi Akhil,
>>>
>>> Tried your suggestion to no avail.  I actually to not see and "jackson"
>>> or "json serde" jars in the $HIVE/lib directory.  This is hive 0.13.1 and
>>> spark 1.2.1
>>>
>>> Here is what I did:
>>>
>>> I have added the lib folder to the –jars option when starting the
>>> spark-shell,
>>> but the job fails. The hive-site.xml is in the $SPARK_HOME/conf
>>> directory.
>>>
>>> I start the spark-shell as follows:
>>>
>>> ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 2 
>>> --driver-class-path /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar
>>>
>>> and like this
>>>
>>> ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 2 
>>> --driver-class-path 
>>> /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar --jars 
>>> /opt/hive/0.13.1/lib/*
>>>
>>> I’m just doing this in the spark-shell now:
>>>
>>> import org.apache.spark.sql.hive._val sqlContext = new 
>>> HiveContext(sc)import sqlContext._case class MetricTable(path: String, 
>>> pathElements: String, name: String, value: String)val mt = new 
>>> MetricTable("""path": "/DC1/HOST1/""",
>>> """pathElements": [{"node": "DataCenter","value": "DC1"},{"node": 
>>> "host","value": "HOST1"}]""",
>>> """name": "Memory Usage (%)""",
>>> """value": 29.590943279257175""")val rdd1 = sc.makeRDD(List(mt))
>>> rdd1.printSchema()
>>> rdd1.registerTempTable("metric_table")
>>> sql(
>>> """SELECT path, name, value, v1.peValue, v1.peName
>>>  FROM metric_table
>>>lateral view json_tuple(pathElements, 'name', 'value') v1
>>>  as peName, peValue
>>> """)
>>> .collect.foreach(println(_))
>>>
>>> It results in the same error:
>>>
>>> 15/04/02 12:33:59 INFO ParseDriver: Parsing command: SELECT path, name, 
>>> value, v1.peValue, v1.peName FROM metric_table   lateral 
>>> view json_tuple(pathElements, 'name', 'value') v1 as peName, 
>>> peValue
>>> 15/04/02 12:34:00 INFO ParseDriver: Parse Completed
>>> res2: org.apache.spark.sql.SchemaRDD =
>>> SchemaRDD[5] at RDD at SchemaRDD.scala:108== Query Plan  Physical Plan 
>>> ==
>>> java.lang.ClassNotFoundException: json_tuple
>>>
>>> Any other suggestions or am I doing something else wrong here?
>>>
>>> -Todd
>>>
>>>
>>>
>>> On Thu, Apr 2, 2015 at 2:00 AM, Akhil Das 
>>> wrote:
>>>
 Try adding all the jars in your $HIVE/lib directory. If you want the
 specific jar, you could look fr jackson or json serde in it.

 Thanks
 Best Regards

 On Thu, Apr 2, 2015 at 12:49 AM, Todd Nist  wrote:

> I have a feeling I’m missing a Jar that provides the support or could
> this may be related to
> https://issues.apache.org/jira/browse/SPARK-5792. If it is a Jar
> where would I find that ? I would have thought in the $HIVE/lib folder, 
> but
> not sure which jar contains it.
>
> Error:
>
> Create Metric Temporary Table for querying15/04/01 14:41:44 INFO 
> HiveMetaStore: 0: Opening raw store with implemenation 
> class:org.apa

Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

2015-04-03 Thread Todd Nist
Hi Akhil,

This is for version 1.2.1.  Well the other thread that you reference was me
attempting it in 1.3.0 to see if the issue was related to 1.2.1.  I did not
build Spark but used the version from the Spark download site for 1.2.1 Pre
Built for Hadoop 2.4 or Later.

Since I get the error in both 1.2.1 and 1.3.0,

15/04/01 14:41:49 INFO ParseDriver: Parse Completed Exception in thread
"main" java.lang.ClassNotFoundException: json_tuple at
java.net.URLClassLoader$1.run(

It looks like I just don't have the jar.  Even including all jars in the
$HIVE/lib directory did not seem to work.  Though when looking in $HIVE/lib
for 0.13.1, I do not see any json serde or jackson files.  I do see that
hive-exec.jar contains
the org/apache/hadoop/hive/ql/udf/generic/GenericUDTFJSONTuple class.  Do
you know if there is another Jar that is required or should it work just by
including all jars from $HIVE/lib?

I can build it locally, but did not think that was required based on the
version I downloaded; is that not the case?

Thanks for the assistance.

-Todd


On Fri, Apr 3, 2015 at 2:06 AM, Akhil Das 
wrote:

> How did you build spark? which version of spark are you having? Doesn't
> this thread already explains it?
> https://www.mail-archive.com/user@spark.apache.org/msg25505.html
>
> Thanks
> Best Regards
>
> On Thu, Apr 2, 2015 at 11:10 PM, Todd Nist  wrote:
>
>> Hi Akhil,
>>
>> Tried your suggestion to no avail.  I actually to not see and "jackson"
>> or "json serde" jars in the $HIVE/lib directory.  This is hive 0.13.1 and
>> spark 1.2.1
>>
>> Here is what I did:
>>
>> I have added the lib folder to the –jars option when starting the
>> spark-shell,
>> but the job fails. The hive-site.xml is in the $SPARK_HOME/conf directory.
>>
>> I start the spark-shell as follows:
>>
>> ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 2 
>> --driver-class-path /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar
>>
>> and like this
>>
>> ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 2 
>> --driver-class-path /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar 
>> --jars /opt/hive/0.13.1/lib/*
>>
>> I’m just doing this in the spark-shell now:
>>
>> import org.apache.spark.sql.hive._val sqlContext = new HiveContext(sc)import 
>> sqlContext._case class MetricTable(path: String, pathElements: String, name: 
>> String, value: String)val mt = new MetricTable("""path": "/DC1/HOST1/""",
>> """pathElements": [{"node": "DataCenter","value": "DC1"},{"node": 
>> "host","value": "HOST1"}]""",
>> """name": "Memory Usage (%)""",
>> """value": 29.590943279257175""")val rdd1 = sc.makeRDD(List(mt))
>> rdd1.printSchema()
>> rdd1.registerTempTable("metric_table")
>> sql(
>> """SELECT path, name, value, v1.peValue, v1.peName
>>  FROM metric_table
>>lateral view json_tuple(pathElements, 'name', 'value') v1
>>  as peName, peValue
>> """)
>> .collect.foreach(println(_))
>>
>> It results in the same error:
>>
>> 15/04/02 12:33:59 INFO ParseDriver: Parsing command: SELECT path, name, 
>> value, v1.peValue, v1.peName FROM metric_table   lateral 
>> view json_tuple(pathElements, 'name', 'value') v1 as peName, 
>> peValue
>> 15/04/02 12:34:00 INFO ParseDriver: Parse Completed
>> res2: org.apache.spark.sql.SchemaRDD =
>> SchemaRDD[5] at RDD at SchemaRDD.scala:108== Query Plan  Physical Plan ==
>> java.lang.ClassNotFoundException: json_tuple
>>
>> Any other suggestions or am I doing something else wrong here?
>>
>> -Todd
>>
>>
>>
>> On Thu, Apr 2, 2015 at 2:00 AM, Akhil Das 
>> wrote:
>>
>>> Try adding all the jars in your $HIVE/lib directory. If you want the
>>> specific jar, you could look fr jackson or json serde in it.
>>>
>>> Thanks
>>> Best Regards
>>>
>>> On Thu, Apr 2, 2015 at 12:49 AM, Todd Nist  wrote:
>>>
 I have a feeling I’m missing a Jar that provides the support or could
 this may be related to https://issues.apache.org/jira/browse/SPARK-5792.
 If it is a Jar where would I find that ? I would have thought in the
 $HIVE/lib folder, but not sure which jar contains it.

 Error:

 Create Metric Temporary Table for querying15/04/01 14:41:44 INFO 
 HiveMetaStore: 0: Opening raw store with implemenation 
 class:org.apache.hadoop.hive.metastore.ObjectStore15/04/01 14:41:44 INFO 
 ObjectStore: ObjectStore, initialize called15/04/01 14:41:45 INFO 
 Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will 
 be ignored15/04/01 14:41:45 INFO Persistence: Property 
 datanucleus.cache.level2 unknown - will be ignored15/04/01 14:41:45 INFO 
 BlockManager: Removing broadcast 015/04/01 14:41:45 INFO BlockManager: 
 Removing block broadcast_015/04/01 14:41:45 INFO MemoryStore: Block 
 broadcast_0 of size 1272 dropped from memory (free 278018571)15/04/01 
 14:41:45 INFO BlockManager: Removing block broadcast_0_piece015/04/01 
>>

Re: Cannot run the example in the Spark 1.3.0 following the document

2015-04-03 Thread Sean Owen
(That one was already fixed last week, and so should be updated when
the site updates for 1.3.1.)

On Fri, Apr 3, 2015 at 4:59 AM, Michael Armbrust  wrote:
> Looks like a typo, try:
>
> df.select(df("name"), df("age") + 1)
>
> Or
>
> df.select("name", "age")
>
> PRs to fix docs are always appreciated :)

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Unable to save dataframe with UDT created with sqlContext.createDataFrame

2015-04-03 Thread Jaonary Rabarisoa
Good! Thank you.

On Thu, Apr 2, 2015 at 9:05 AM, Xiangrui Meng  wrote:

> I reproduced the bug on master and submitted a patch for it:
> https://github.com/apache/spark/pull/5329. It may get into Spark
> 1.3.1. Thanks for reporting the bug! -Xiangrui
>
> On Wed, Apr 1, 2015 at 12:57 AM, Jaonary Rabarisoa 
> wrote:
> > Hmm, I got the same error with the master. Here is another test example
> that
> > fails. Here, I explicitly create
> > a Row RDD which corresponds to the use case I am in :
> >
> > object TestDataFrame {
> >
> >   def main(args: Array[String]): Unit = {
> >
> > val conf = new
> > SparkConf().setAppName("TestDataFrame").setMaster("local[4]")
> > val sc = new SparkContext(conf)
> > val sqlContext = new SQLContext(sc)
> >
> > import sqlContext.implicits._
> >
> > val data = Seq(LabeledPoint(1, Vectors.zeros(10)))
> > val dataDF = sc.parallelize(data).toDF
> >
> > dataDF.printSchema()
> > dataDF.save("test1.parquet") // OK
> >
> > val dataRow = data.map {case LabeledPoint(l: Double, f:
> > mllib.linalg.Vector)=>
> >   Row(l,f)
> > }
> >
> > val dataRowRDD = sc.parallelize(dataRow)
> > val dataDF2 = sqlContext.createDataFrame(dataRowRDD, dataDF.schema)
> >
> > dataDF2.printSchema()
> >
> > dataDF2.saveAsParquetFile("test3.parquet") // FAIL !!!
> >   }
> > }
> >
> >
> > On Tue, Mar 31, 2015 at 11:18 PM, Xiangrui Meng 
> wrote:
> >>
> >> I cannot reproduce this error on master, but I'm not aware of any
> >> recent bug fixes that are related. Could you build and try the current
> >> master? -Xiangrui
> >>
> >> On Tue, Mar 31, 2015 at 4:10 AM, Jaonary Rabarisoa 
> >> wrote:
> >> > Hi all,
> >> >
> >> > DataFrame with an user defined type (here mllib.Vector) created with
> >> > sqlContex.createDataFrame can't be saved to parquet file and raise
> >> > ClassCastException: org.apache.spark.mllib.linalg.DenseVector cannot
> be
> >> > cast
> >> > to org.apache.spark.sql.Row error.
> >> >
> >> > Here is an example of code to reproduce this error :
> >> >
> >> > object TestDataFrame {
> >> >
> >> >   def main(args: Array[String]): Unit = {
> >> > //System.loadLibrary(Core.NATIVE_LIBRARY_NAME)
> >> > val conf = new
> >> > SparkConf().setAppName("RankingEval").setMaster("local[8]")
> >> >   .set("spark.executor.memory", "6g")
> >> >
> >> > val sc = new SparkContext(conf)
> >> > val sqlContext = new SQLContext(sc)
> >> >
> >> > import sqlContext.implicits._
> >> >
> >> > val data = sc.parallelize(Seq(LabeledPoint(1, Vectors.zeros(10
> >> > val dataDF = data.toDF
> >> >
> >> > dataDF.save("test1.parquet")
> >> >
> >> > val dataDF2 = sqlContext.createDataFrame(dataDF.rdd,
> dataDF.schema)
> >> >
> >> > dataDF2.save("test2.parquet")
> >> >   }
> >> > }
> >> >
> >> >
> >> > Is this related to https://issues.apache.org/jira/browse/SPARK-5532
> and
> >> > how
> >> > can it be solved ?
> >> >
> >> >
> >> > Cheers,
> >> >
> >> >
> >> > Jao
> >
> >
>


About Waiting batches on the spark streaming UI

2015-04-03 Thread bit1...@163.com

I copied the following from the spark streaming UI, I don't know why the 
Waiting batches is 1, my understanding is that it should be 72.
Following  is my understanding: 
1. Total time is 1minute 35 seconds=95 seconds
2. Batch interval is 1 second, so, 95 batches are generated in 95 seconds.
3. Processed batches are 23(Correct, because in my processing code, it does 
nothing but sleep 4 seconds)
4. Then the waiting batches should be 95-23=72


Started at: Fri Apr 03 15:17:47 CST 2015 
Time since start: 1 minute 35 seconds 
Network receivers: 1 
Batch interval: 1 second 
Processed batches: 23 
Waiting batches: 1 
Received records: 0 
Processed records: 0   



bit1...@163.com


Re: Spark Application Stages and DAG

2015-04-03 Thread Tathagata Das
What he meant is that look it up in the Spark UI, specifically in the Stage
tab to see what is taking so long. And yes code snippet helps us debug.

TD

On Fri, Apr 3, 2015 at 12:47 AM, Akhil Das 
wrote:

> You need open the Stage\'s page which is taking time, and see how long its
> spending on GC etc. Also it will be good to post that Stage and its
> previous transformation's code snippet to make us understand it better.
>
> Thanks
> Best Regards
>
> On Fri, Apr 3, 2015 at 1:05 PM, Vijay Innamuri 
> wrote:
>
>>
>> When I run the Spark application (streaming) in local mode I could see
>> the execution progress as below..
>>
>> [Stage
>> 0:>
>> (1817 + 1) / 3125]
>> 
>> [Stage
>> 2:===>
>> (740 + 1) / 3125]
>>
>> One of the stages is taking long time for execution.
>>
>> How to find the transformations/ actions associated with a particular
>> stage?
>> Is there anyway to find the execution DAG of a Spark Application?
>>
>> Regards
>> Vijay
>>
>
>


Re: Spark Application Stages and DAG

2015-04-03 Thread Akhil Das
You need open the Stage\'s page which is taking time, and see how long its
spending on GC etc. Also it will be good to post that Stage and its
previous transformation's code snippet to make us understand it better.

Thanks
Best Regards

On Fri, Apr 3, 2015 at 1:05 PM, Vijay Innamuri 
wrote:

>
> When I run the Spark application (streaming) in local mode I could see the
> execution progress as below..
>
> [Stage
> 0:>
> (1817 + 1) / 3125]
> 
> [Stage
> 2:===>
> (740 + 1) / 3125]
>
> One of the stages is taking long time for execution.
>
> How to find the transformations/ actions associated with a particular
> stage?
> Is there anyway to find the execution DAG of a Spark Application?
>
> Regards
> Vijay
>


Spark Application Stages and DAG

2015-04-03 Thread Vijay Innamuri
When I run the Spark application (streaming) in local mode I could see the
execution progress as below..

[Stage
0:>
(1817 + 1) / 3125]

[Stage
2:===>
(740 + 1) / 3125]

One of the stages is taking long time for execution.

How to find the transformations/ actions associated with a particular stage?
Is there anyway to find the execution DAG of a Spark Application?

Regards
Vijay