Re: How to unit test HiveContext without OutOfMemoryError (using sbt)

2015-08-26 Thread Mike Trienis
idtest/id goals goaltest/goal /goals /execution /executions /plugin /plugins On Tue, Aug 25, 2015 at 2:10 PM, Mike Trienis mike.trie...@orcsol.com wrote: Hello

How to unit test HiveContext without OutOfMemoryError (using sbt)

2015-08-25 Thread Mike Trienis
Hello, I am using sbt and created a unit test where I create a `HiveContext` and execute some query and then return. Each time I run the unit test the JVM will increase it's memory usage until I get the error: Internal error when running tests: java.lang.OutOfMemoryError: PermGen space Exception

Spark SQL window functions (RowsBetween)

2015-08-20 Thread Mike Trienis
Hi All, I would like some clarification regarding window functions for Apache Spark 1.4.0 - https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html In particular, the rowsBetween * {{{ * val w = Window.partitionBy(name).orderBy(id) * df.select(

Optimal way to implement a small lookup table for identifiers in an RDD

2015-08-10 Thread Mike Trienis
Hi All, I have an RDD of case class objects. scala case class Entity( | value: String, | identifier: String | ) defined class Entity scala Entity(hello, id1) res25: Entity = Entity(hello,id1) During a map operation, I'd like to return a new RDD that contains all of the

Re: Data frames select and where clause dependency

2015-07-20 Thread Mike Trienis
[mailto:rhbutani.sp...@gmail.com] *Sent:* Monday, July 20, 2015 5:37 PM *To:* Mohammed Guller *Cc:* Michael Armbrust; Mike Trienis; user@spark.apache.org *Subject:* Re: Data frames select and where clause dependency Yes via: org.apache.spark.sql.catalyst.optimizer.ColumnPruning See

Data frames select and where clause dependency

2015-07-17 Thread Mike Trienis
I'd like to understand why the where field must exist in the select clause. For example, the following select statement works fine - df.select(field1, filter_field).filter(df(filter_field) === value).show() However, the next one fails with the error in operator !Filter (filter_field#60 =

Aggregating metrics using Cassandra and Spark streaming

2015-06-24 Thread Mike Trienis
Hello, I'd like to understand how other people have been aggregating metrics using Spark Streaming and Cassandra database. Currently I have design some data models that will stored the rolled up metrics. There are two models that I am considering: CREATE TABLE rollup_using_counters (

Re: Managing spark processes via supervisord

2015-06-05 Thread Mike Trienis
since they are usually foreground processes with master it's a bit more complicated, ./sbin/start-master.sh goes background which is not good for supervisor, but anyway I think it's doable(going to setup it too in a few days) On 3 June 2015 at 21:46, Mike Trienis mike.trie...@orcsol.com wrote

Managing spark processes via supervisord

2015-06-03 Thread Mike Trienis
Hi All, I am curious to know if anyone has successfully deployed a spark cluster using supervisord? - http://supervisord.org/ Currently I am using the cluster launch scripts which are working greater, however, every time I reboot my VM or development environment I need to re-launch the

Re: Spark Streaming: all tasks running on one executor (Kinesis + Mongodb)

2015-05-23 Thread Mike Trienis
core, an executor is simply a jvm instance and as such it can be granted any number of cores and ram So check how many cores you have per executor Sent from Samsung Mobile Original message From: Mike Trienis Date:2015/05/22 21:51 (GMT+00:00) To: user@spark.apache.org

Spark Streaming: all tasks running on one executor (Kinesis + Mongodb)

2015-05-22 Thread Mike Trienis
Hi All, I have cluster of four nodes (three workers and one master, with one core each) which consumes data from Kinesis at 15 second intervals using two streams (i.e. receivers). The job simply grabs the latest batch and pushes it to MongoDB. I believe that the problem is that all tasks are

Re: Spark Streaming: all tasks running on one executor (Kinesis + Mongodb)

2015-05-22 Thread Mike Trienis
I guess each receiver occupies a executor. So there was only one executor available for processing the job. On Fri, May 22, 2015 at 1:24 PM, Mike Trienis mike.trie...@orcsol.com wrote: Hi All, I have cluster of four nodes (three workers and one master, with one core each) which consumes data

Re: Spark + Kinesis + Stream Name + Cache?

2015-05-08 Thread Mike Trienis
when you do this? I saw a lot of lease not owned by this Kinesis Client type of errors, from what I remember. lemme know! -Chris On May 8, 2015, at 4:36 PM, Mike Trienis mike.trie...@orcsol.com wrote: - [Kinesis stream name]: The Kinesis stream that this streaming application

Re: Spark + Kinesis + Stream Name + Cache?

2015-05-08 Thread Mike Trienis
. If you see errors, you may need to manually delete the DynamoDB table.* On Fri, May 8, 2015 at 2:06 PM, Mike Trienis mike.trie...@orcsol.com wrote: Hi All, I am submitting the assembled fat jar file by the command: bin/spark-submit --jars /spark-streaming-kinesis-asl_2.10-1.3.0.jar

Spark + Kinesis + Stream Name + Cache?

2015-05-08 Thread Mike Trienis
Hi All, I am submitting the assembled fat jar file by the command: bin/spark-submit --jars /spark-streaming-kinesis-asl_2.10-1.3.0.jar --class com.xxx.Consumer -0.1-SNAPSHOT.jar It reads the data file from kinesis using the stream name defined in a configuration file. It turns out that it

Re: sbt-assembly spark-streaming-kinesis-asl error

2015-04-14 Thread Mike Trienis
with no success :( Would be curious to know if you got it working. Vadim On Apr 13, 2015, at 9:36 PM, Mike Trienis mike.trie...@orcsol.com wrote: Hi All, I have having trouble building a fat jar file through sbt-assembly. [warn] Merging 'META-INF/NOTICE.txt' with strategy 'rename' [warn

Re: sbt-assembly spark-streaming-kinesis-asl error

2015-04-14 Thread Mike Trienis
a similar situation. I hope that gives some ideas for resolving your issue. Regards, Rich On Tue, Apr 14, 2015 at 1:14 PM, Mike Trienis mike.trie...@orcsol.com wrote: Hi Vadim, After removing provided from org.apache.spark %% spark-streaming-kinesis-asl I ended up with huge number

sbt-assembly spark-streaming-kinesis-asl error

2015-04-13 Thread Mike Trienis
Hi All, I have having trouble building a fat jar file through sbt-assembly. [warn] Merging 'META-INF/NOTICE.txt' with strategy 'rename' [warn] Merging 'META-INF/NOTICE' with strategy 'rename' [warn] Merging 'META-INF/LICENSE.txt' with strategy 'rename' [warn] Merging 'META-INF/LICENSE' with

Re: sbt-assembly spark-streaming-kinesis-asl error

2015-04-13 Thread Mike Trienis
got it working. Vadim On Apr 13, 2015, at 9:36 PM, Mike Trienis mike.trie...@orcsol.com wrote: Hi All, I have having trouble building a fat jar file through sbt-assembly. [warn] Merging 'META-INF/NOTICE.txt' with strategy 'rename' [warn] Merging 'META-INF/NOTICE' with strategy 'rename

Re: Cannot run unit test.

2015-04-08 Thread Mike Trienis
It's because your tests are running in parallel and you can only have one context running at a time. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-run-unit-test-tp14459p22429.html Sent from the Apache Spark User List mailing list archive at

Re: Spark Streaming S3 Performance Implications

2015-04-01 Thread Mike Trienis
_ From: Mike Trienis mike.trie...@orcsol.com Sent: Wednesday, March 18, 2015 2:45 PM Subject: Spark Streaming S3 Performance Implications To: user@spark.apache.org Hi All, I am pushing data from Kinesis stream to S3 using Spark Streaming and noticed that during

Spark Streaming S3 Performance Implications

2015-03-18 Thread Mike Trienis
Hi All, I am pushing data from Kinesis stream to S3 using Spark Streaming and noticed that during testing (i.e. master=local[2]) the batches (1 second intervals) were falling behind the incoming data stream at about 5-10 events / second. It seems that the rdd.saveAsTextFile(s3n://...) is taking

Re: Writing to S3 and retrieving folder names

2015-03-05 Thread Mike Trienis
Please ignore my question, you can simply specify the root directory and it looks like redshift takes care of the rest. copy mobile from 's3://BUCKET_NAME/' credentials json 's3://BUCKET_NAME/jsonpaths.json' On Thu, Mar 5, 2015 at 3:33 PM, Mike Trienis mike.trie...@orcsol.com wrote: Hi

Writing to S3 and retrieving folder names

2015-03-05 Thread Mike Trienis
Hi All, I am receiving data from AWS Kinesis using Spark Streaming and am writing the data collected in the dstream to s3 using output function: dstreamData.saveAsTextFiles(s3n://XXX:XXX@/) After the run the application for several seconds, I end up with a sequence of directories in S3 that

Pushing data from AWS Kinesis - Spark Streaming - AWS Redshift

2015-03-01 Thread Mike Trienis
Hi All, I am looking at integrating a data stream from AWS Kinesis to AWS Redshift and since I am already ingesting the data through Spark Streaming, it seems convenient to also push that data to AWS Redshift at the same time. I have taken a look at the AWS kinesis connector although I am not

Integrating Spark Streaming with Reactive Mongo

2015-02-26 Thread Mike Trienis
Hi All, I have Spark Streaming setup to write data to a replicated MongoDB database and would like to understand if there would be any issues using the Reactive Mongo library to write directly to the mongoDB? My stack is Apache Spark sitting on top of Cassandra for the datastore, so my thinking

Integrating Spark Streaming with Reactive Mongo

2015-02-26 Thread Mike Trienis
Hi All, I have Spark Streaming setup to write data to a replicated MongoDB database and would like to understand if there would be any issues using the Reactive Mongo library to write directly to the mongoDB? My stack is Apache Spark sitting on top of Cassandra for the datastore, so my thinking

Re: Datastore HDFS vs Cassandra

2015-02-11 Thread Mike Trienis
. Februar 2015 10:03 An: Paolo Platter paolo.plat...@agilelab.it Cc: Mike Trienis mike.trie...@orcsol.com, user@spark.apache.org user@spark.apache.org Betreff: Re: Datastore HDFS vs Cassandra One additional comment I would make is that you should be careful with Updates in Cassandra, it does