Re: Manipulating RDDs within a DStream

2014-10-30 Thread lalit1303
Hi, Since, the cassandra object is not serializable you can't open the connection on driver level and access the object inside foreachRDD (i.e. at worker level). You have to open connection inside foreachRDD only, perform the operation and then close the connection. For example: wordCounts.fore

Re: Submiting Spark application through code

2014-10-30 Thread Sonal Goyal
What do your worker logs say? Best Regards, Sonal Nube Technologies On Fri, Oct 31, 2014 at 11:44 AM, sivarani wrote: > I tried running it but dint work > > public static final SparkConf batchConf= new SparkConf(); > String mast

Re: Scaladoc

2014-10-30 Thread Kamal Banga
In IntelliJ, Tools > Generate Scaladoc. Kamal On Fri, Oct 31, 2014 at 5:35 AM, Alessandro Baretta wrote: > How do I build the scaladoc html files from the spark source distribution? > > Alex Bareta >

Re: Submiting Spark application through code

2014-10-30 Thread sivarani
I tried running it but dint work public static final SparkConf batchConf= new SparkConf(); String master = "spark://sivarani:7077"; String spark_home ="/home/sivarani/spark-1.0.2-bin-hadoop2/"; String jar = "/home/sivarani/build/Test.jar"; public static final JavaSparkContext batchSparkContext = n

Re: Using a Database to persist and load data from

2014-10-30 Thread Yanbo Liang
AFAIK, you can read data from DB with JdbcRDD, but there is no interface for writing to DB. JdbcRDD has some restrict such as SQL must with "where" clause. For writing to DB, you can use mapPartitions or foreachPartition to implement. You can refer this example: http://stackoverflow.com/questions/

Re: NonSerializable Exception in foreachRDD

2014-10-30 Thread Tobias Pfeiffer
Harold, just mentioning it in case you run into it: If you are in a separate thread, there are apparently stricter limits to what you can and cannot serialize: val someVal future { // be very careful with defining RDD operations using someVal here val myLocalVal = someVal // use myLocalVal

Re: use additional ebs volumes for hsdf storage with spark-ec2

2014-10-30 Thread Daniel Mahler
Thanks Akhil. I tried changing /root/ephemeral-hdfs/conf/hdfs-site.xml to have dfs.data.dir /vol,/vol0,/vol1,/vol2,/vol3,/vol4,/vol5,/vol6,/vol7,/mnt/ephemeral-hdfs/data,/mnt2/ephemeral-hdfs/data and then running /root/ephemeral-hdfs/bin/stop-all.sh copy-dir /root/ephemeral-hdfs/conf

Re: Spray client reports Exception: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext

2014-10-30 Thread Jianshi Huang
Hi Preshant, Chester, Mohammed, I switched to Spark's Akka and now it works well. Thanks for the help! (Need to exclude Akka from Spray dependencies, or specify it as provided) Jianshi On Thu, Oct 30, 2014 at 3:17 AM, Mohammed Guller wrote: > I am not sure about that. > > > > Can you try a

Re: Doing RDD."count" in parallel , at at least parallelize it as much as possible?

2014-10-30 Thread Sonal Goyal
Hey Sameer, Wouldnt local[x] run count parallelly in each of the x threads? Best Regards, Sonal Nube Technologies On Thu, Oct 30, 2014 at 11:42 PM, Sameer Farooqui wrote: > Hi Shahab, > > Are you running Spark in Local, Standal

Spark Streaming Issue not running 24/7

2014-10-30 Thread sivarani
The problem is simple I want a to stream data 24/7 do some calculations and save the result in a csv/json file so that i could use it for visualization using dc.js/d3.js I opted for spark streaming on yarn cluster with kafka tried running it for 24/7 Using GroupByKey and updateStateByKey to have

SizeEstimator in Spark 1.1 and high load/object allocation when reading in data

2014-10-30 Thread Erik Freed
Hi All, We have recently moved to Spark 1.1 from 0.9 for an application handling a fair number of very large datasets partitioned across multiple nodes. About half of each of these large datasets is stored in off heap byte arrays and about half in the standard Java heap. While these datasets are

Re: Use RDD like a Iterator

2014-10-30 Thread Zhan Zhang
RDD.toLocalIterator return the partition one by one but with all elements in the partition, which is not lazy calculated. Given the design of spark, it is very hard to maintain the state of iterator across runJob. def toLocalIterator: Iterator[T] = { def collectPartition(p: Int): Array[T]

Re: SparkSQL + Hive Cached Table Exception

2014-10-30 Thread Michael Armbrust
Hmmm, this looks like a bug. Can you file a JIRA? On Thu, Oct 30, 2014 at 4:04 PM, Jean-Pascal Billaud wrote: > Hi, > > While testing SparkSQL on top of our Hive metastore, I am getting > some java.lang.ArrayIndexOutOfBoundsException while reusing a cached RDD > table. > > Basically, I have a t

Re: [scala-user] Why aggregate is inconsistent?

2014-10-30 Thread Xuefeng Wu
My other question is that Spark why not provide foldLeft: *def foldLeft[U](zeroValue: U)(op: (U, T) => T): U *but aggregate. the *def fold(zeroValue: T)(op: (T, T) => T): T* in spark is not deterministic too. On Thu, Oct 30, 2014 at 3:50 PM, Jason Zaugg wrote: > On Thu, Oct 30, 2014 at 5:39 PM

Re: Confused about class paths in spark 1.1.0

2014-10-30 Thread Matei Zaharia
Yeah, I think you should file this as a bug. The problem is that JARs need to also be added into the Scala compiler and REPL class loader, and we probably don't do this for the ones in this driver config property. Matei > On Oct 30, 2014, at 6:07 PM, Shay Seng wrote: > > -- jars does indeed w

Re: Confused about class paths in spark 1.1.0

2014-10-30 Thread Shay Seng
-- jars does indeed work but this causes the "jars" to also get shipped to the workers -- which I don't want to do for efficiency reasons. I think you are saying that setting "spark.driver.extraClassPath" in spark-default.conf ought to have the same behavior as providing "--driver.class.apth" to

Re: Confused about class paths in spark 1.1.0

2014-10-30 Thread Matei Zaharia
Try using --jars instead of the driver-only options; they should work with spark-shell too but they may be less tested. Unfortunately, you do have to specify each JAR separately; you can maybe use a shell script to list a directory and get a big list, or set up a project that builds all of the

Re: issue on applying SVM to 5 million examples.

2014-10-30 Thread peng xia
Thanks Jimmy. I will have a try. Thanks very much for your guys' help. Best, Peng On Thu, Oct 30, 2014 at 8:19 PM, Jimmy wrote: > sampleRDD. cache() > > Sent from my iPhone > > On Oct 30, 2014, at 5:01 PM, peng xia wrote: > > Hi Xiangrui, > > Can you give me some code example about caching, as

Confused about class paths in spark 1.1.0

2014-10-30 Thread Shay Seng
Hi, I've been trying to move up from spark 0.9.2 to 1.1.0. I'm getting a little confused with the setup for a few different use cases, grateful for any pointers... (1) spark-shell + with jars that are only required by the driver (1a) I added "spark.driver.extraClassPath /mypath/to.jar" to my spa

Re: issue on applying SVM to 5 million examples.

2014-10-30 Thread Jimmy
sampleRDD. cache() Sent from my iPhone > On Oct 30, 2014, at 5:01 PM, peng xia wrote: > > Hi Xiangrui, > > Can you give me some code example about caching, as I am new to Spark. > > Thanks, > Best, > Peng > >> On Thu, Oct 30, 2014 at 6:57 PM, Xiangrui Meng wrote: >> Then caching should sol

Re: SparkContext UI

2014-10-30 Thread Sameer Farooqui
Hi Stuart, You're close! Just add a () after the cache, like: data.cache() ...and then run the .count() action on it and you should be good to see it in the Storage UI! - Sameer On Thu, Oct 30, 2014 at 4:50 PM, Stuart Horsman wrote: > Sorry too quick to pull the trigger on my original email

Scaladoc

2014-10-30 Thread Alessandro Baretta
How do I build the scaladoc html files from the spark source distribution? Alex Bareta

Re: issue on applying SVM to 5 million examples.

2014-10-30 Thread peng xia
Hi Xiangrui, Can you give me some code example about caching, as I am new to Spark. Thanks, Best, Peng On Thu, Oct 30, 2014 at 6:57 PM, Xiangrui Meng wrote: > Then caching should solve the problem. Otherwise, it is just loading > and parsing data from disk for each iteration. -Xiangrui > > On

Re: SparkContext UI

2014-10-30 Thread Stuart Horsman
Sorry too quick to pull the trigger on my original email. I should have added that I'm tried using persist() and cache() but no joy. I'm doing this: data = sc.textFile("somedata") data.cache data.count() but I still can't see anything in the storage? On 31 October 2014 10:42, Sameer Farooq

Re: SparkContext UI

2014-10-30 Thread Sameer Farooqui
Hey Stuart, The RDD won't show up under the Storage tab in the UI until it's been cached. Basically Spark doesn't know what the RDD will look like until it's cached, b/c up until then the RDD is just on disk (external to Spark). If you launch some transformations + an action on an RDD that is pure

SparkContext UI

2014-10-30 Thread Stuart Horsman
Hi All, When I load an RDD with: data = sc.textFile("somefile") I don't see the resulting RDD in the SparkContext gui on localhost:4040 in /storage. Is there something special I need to do to allow me to view this? I tried but scala and python shells but same result. Thanks Stuart

Re: akka connection refused bug, fix?

2014-10-30 Thread freedafeng
followed this http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Akka-Error-while-running-Spark-Jobs/td-p/18602 but the problem was not fixed.. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/akka-connection-refused-bug-fix-tp17764p17774.html

SparkSQL + Hive Cached Table Exception

2014-10-30 Thread Jean-Pascal Billaud
Hi, While testing SparkSQL on top of our Hive metastore, I am getting some java.lang.ArrayIndexOutOfBoundsException while reusing a cached RDD table. Basically, I have a table "mtable" partitioned by some "date" field in hive and below is the scala code I am running in spark-shell: val sqlContex

Re: issue on applying SVM to 5 million examples.

2014-10-30 Thread Xiangrui Meng
Then caching should solve the problem. Otherwise, it is just loading and parsing data from disk for each iteration. -Xiangrui On Thu, Oct 30, 2014 at 11:44 AM, peng xia wrote: > Thanks for all your help. > I think I didn't cache the data. My previous cluster was expired and I don't > have a chanc

RE: how idf is calculated

2014-10-30 Thread Ashic Mahtab
Hi Andrejs,The calculations are a bit different to what I've come across in Mining Massive Datasets (2nd Ed. Ullman et. al., Cambridge Press) available here:http://www.mmds.org/ Their calculation of IDF is as follows: IDFi = log2(N / ni) where N is the number of documents and ni is the number o

how idf is calculated

2014-10-30 Thread Andrejs Abele
Hi, I'm writing a paper and I need to calculate tf-idf. Whit your help I managed to get results, I needed, but the problem is that I need to be able to explain how each number was gotten. So I tried to understand how idf was calculated and the numbers i get don't correspond to those I should get .

Re: does updateStateByKey accept a state that is a tuple?

2014-10-30 Thread spr
I think I understand how to deal with this, though I don't have all the code working yet. The point is that the V of (K, V) can itself be a tuple. So the updateFunc prototype looks something like val updateDhcpState = (newValues: Seq[Tuple1[(Int, Time, Time)]], state: Option[Tuple1[(Int, Tim

Re: Best way to partition RDD

2014-10-30 Thread shahab
Thanks Helena, very useful comment, But is "‘spark.cassandra.input.split.size" only effective in Cluster not in Single node? best, /Shahab On Thu, Oct 30, 2014 at 6:26 PM, Helena Edelson wrote: > Shahab, > > Regardless, WRT cassandra and spark when using the spark cassandra > connector, ‘spark

Re: Creating a SchemaRDD from RDD of thrift classes

2014-10-30 Thread Michael Armbrust
That should be possible, although I'm not super familiar with thrift. You'll probably need access to the generated metadata . If you find yourself reading a lot of thrift data you might consider w

Creating a SchemaRDD from RDD of thrift classes

2014-10-30 Thread ankits
I have one job with spark that creates some RDDs of type X and persists them in memory. The type X is an auto generated Thrift java class (not a case class though). Now in another job, I want to convert the RDD to a SchemaRDD using sqlContext.applySchema(). Can I derive a schema from the thrift def

Registering custom metrics

2014-10-30 Thread Gerard Maas
vHi, I've been exploring the metrics exposed by Spark and I'm wondering whether there's a way to register job-specific metrics that could be exposed through the existing metrics system. Would there be an example somewhere? BTW, documentation about how the metrics work could be improved. I found

akka connection refused bug, fix?

2014-10-30 Thread freedafeng
Hi, I saw the same issue as this thread, http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-1-akka-connection-refused-td9864.html Anyone has a fix for this bug? Please?! The log info in my worker node is like, 14/10/30 20:15:18 INFO Worker: Asked to kill executor app-20141030201514-0

Re: k-mean - result interpretation

2014-10-30 Thread Sean Owen
Yes that is exactly it. The values are not comparable since normalization is also shrinking all distances. Squared error is not an absolute metric. I haven't thought about this much but maybe you are looking for something like the silhouette coefficient? On Oct 30, 2014 5:35 PM, "mgCl2" wrote: >

Re: Do Spark executors restrict native heap vs JVM heap?

2014-10-30 Thread Sean Owen
No, but, the JVM also does not allocate memory for native code on the heap. I dont think heap has any bearing on whether your native code can't allocate more memory except that of course the heap is also taking memory. On Oct 30, 2014 6:43 PM, "Paul Wais" wrote: > Dear Spark List, > > I have a Sp

Re: Out of memory with Spark Streaming

2014-10-30 Thread Chris Fregly
curious about why you're only seeing 50 records max per batch. how many receivers are you running? what is the rate that you're putting data onto the stream? per the default AWS kinesis configuration, the producer can do 1000 PUTs per second with max 50k bytes per PUT and max 1mb per second per

Re: Algebird using spark-shell

2014-10-30 Thread Ian O'Connell
Algebird 0.8.0 has 2.11 support if you want to run in a 2.11 env. On Thu, Oct 30, 2014 at 10:08 AM, Buntu Dev wrote: > Thanks.. I was using Scala 2.11.1 and was able to > use algebird-core_2.10-0.1.11.jar with spark-shell. > > On Thu, Oct 30, 2014 at 8:22 AM, Ian O'Connell > wrote: > >> Whats t

Re: issue on applying SVM to 5 million examples.

2014-10-30 Thread peng xia
Thanks for all your help. I think I didn't cache the data. My previous cluster was expired and I don't have a chance to check the load balance or app manager. Below is my code. There are 18 features for each record and I am using the Scala API. import org.apache.spark.SparkConf import org.apache.s

Do Spark executors restrict native heap vs JVM heap?

2014-10-30 Thread Paul Wais
Dear Spark List, I have a Spark app that runs native code inside map functions. I've noticed that the native code sometimes sets errno to ENOMEM indicating a lack of available memory. However, I've verified that the /JVM/ has plenty of heap space available-- Runtime.getRuntime().freeMemory() sho

does updateStateByKey accept a state that is a tuple?

2014-10-30 Thread spr
I'm trying to implement a Spark Streaming program to calculate the number of instances of a given key encountered and the minimum and maximum times at which it was encountered. updateStateByKey seems to be just the thing, but when I define the "state" to be a tuple, I get compile errors I'm not fi

Re: stage failure: java.lang.IllegalStateException: unread block data

2014-10-30 Thread freedafeng
The worker side has error message as this, 14/10/30 18:29:00 INFO Worker: Asked to launch executor app-20141030182900-0006/0 for testspark_v1 14/10/30 18:29:01 INFO ExecutorRunner: Launch command: "java" "-cp" "::/root/spark-1.1.0/conf:/root/spark-1.1.0/assembly/target/scala-2.10/spark-assembly-1.

Re: Ambiguous references to id : what does it mean ?

2014-10-30 Thread Terry Siu
Found this as I am having the same issue. I have exactly the same usage as shown in Michael's join example. I tried executing a SQL statement against the join data set with two columns that have the same name and tried to "unambiguate" the column name with the table alias, but I would still get an

Re: Doing RDD."count" in parallel , at at least parallelize it as much as possible?

2014-10-30 Thread Sameer Farooqui
By the way, in case you haven't done so, do try to .cache() the RDD before running a .count() on it as that could make a big speed improvement. On Thu, Oct 30, 2014 at 11:12 AM, Sameer Farooqui wrote: > Hi Shahab, > > Are you running Spark in Local, Standalone, YARN or Mesos mode? > > If you'r

Re: Doing RDD."count" in parallel , at at least parallelize it as much as possible?

2014-10-30 Thread Sameer Farooqui
Hi Shahab, Are you running Spark in Local, Standalone, YARN or Mesos mode? If you're running in Standalone/YARN/Mesos, then the .count() action is indeed automatically parallelized across multiple Executors. When you run a .count() on an RDD, it is actually distributing tasks to different execut

stage failure: java.lang.IllegalStateException: unread block data

2014-10-30 Thread freedafeng
Hi, Got this error when running spark 1.1.0 to read Hbase 0.98.1 through simple python code in a ec2 cluster. The same program runs correctly in local mode. So this error only happens when running in a real cluster. Here's what I got, 14/10/30 17:51:53 INFO TaskSetManager: Starting task 0.1 in

Re: Returned type of Broadcast variable is byte array

2014-10-30 Thread Stephen Boesch
The byte array turns out to be a serialized ObjectOutputStream that contains a Tuple2[ParallelCollectionRDD,Function2]. What then should be done differently in the broadcast code (which follows the structure of an example taken from mllib)? assert(crows.isInstanceOf[Array[MVector]]) val bcRows =

k-mean - result interpretation

2014-10-30 Thread mgCl2
Hello everyone, I'm trying to use MLlib's K-mean algorithm. I tried it on raw data, Here is a example of a line contained in my input data set: 82.9817 3281.4495 with those parameters: *numClusters*=4 *numIterations*=20 results: *WSSSE = 6.375371241589461E9* Then I normalized my data: 0.022190

Re: MLLib: libsvm - default value initialization

2014-10-30 Thread Xiangrui Meng
You can remove 0.5 from all non-zeros. -Xiangrui On Wed, Oct 29, 2014 at 9:20 PM, Sameer Tilak wrote: > Hi All, > I have my sparse data in libsvm format. > > val examples: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc, > "mllib/data/sample_libsvm_data.txt") > > I am running Linear regression. Let

Re: Best way to partition RDD

2014-10-30 Thread Helena Edelson
Shahab, Regardless, WRT cassandra and spark when using the spark cassandra connector, ‘spark.cassandra.input.split.size’ passed into the SparkConf configures the approx number of Cassandra partitions in a Spark partition (default 10). No repartitioning should be necessary with what you have

Doing RDD."count" in parallel , at at least parallelize it as much as possible?

2014-10-30 Thread shahab
Hi, I noticed that the "count" (of RDD) in many of my queries is the most time consuming one as it runs in the "driver" process rather then done by parallel worker nodes, Is there any way to perform "count" in parallel , at at least parallelize it as much as possible? best, /Shahab

Re: issue on applying SVM to 5 million examples.

2014-10-30 Thread Xiangrui Meng
DId you cache the data and check the load balancing? How many features? Which API are you using, Scala, Java, or Python? -Xiangrui On Thu, Oct 30, 2014 at 9:13 AM, Jimmy wrote: > Watch the app manager it should tell you what's running and taking awhile... > My guess it's a "distinct" function on

Re: Best way to partition RDD

2014-10-30 Thread shahab
Hi Helena, Well... I am just running a toy example, I have one Cassandra node co-located with the Spark Master and one of Spark Workers, all in one machine. I have another node which runs the second Spark worker. /Shahab, On Thu, Oct 30, 2014 at 6:12 PM, Helena Edelson wrote: > Hi Shahab, > -

Re: Best way to partition RDD

2014-10-30 Thread Helena Edelson
Hi Shahab, -How many spark/cassandra nodes are in your cluster? -What is your deploy topology for spark and cassandra clusters? Are they co-located? - Helena @helenaedelson On Oct 30, 2014, at 12:16 PM, shahab wrote: > Hi. > > I am running an application in the Spark which first loads data f

Re: Algebird using spark-shell

2014-10-30 Thread Buntu Dev
Thanks.. I was using Scala 2.11.1 and was able to use algebird-core_2.10-0.1.11.jar with spark-shell. On Thu, Oct 30, 2014 at 8:22 AM, Ian O'Connell wrote: > Whats the error with the 2.10 version of algebird? > > On Thu, Oct 30, 2014 at 12:49 AM, thadude wrote: > >> I've tried: >> >> . /bin/spa

Manipulating RDDs within a DStream

2014-10-30 Thread Harold Nguyen
Hi all, I'd like to be able to modify values in a DStream, and then send it off to an external source like Cassandra, but I keep getting Serialization errors and am not sure how to use the correct design pattern. I was wondering if you could help me. I'd like to be able to do the following: wor

Re: Manipulating RDDs within a DStream

2014-10-30 Thread Harold Nguyen
Hi, Sorry, there's a typo there: val arr = rdd.toArray Harold On Thu, Oct 30, 2014 at 9:58 AM, Harold Nguyen wrote: > Hi all, > > I'd like to be able to modify values in a DStream, and then send it off to > an external source like Cassandra, but I keep getting Serialization errors > and am n

Re: GC Issues with randomSplit on large dataset

2014-10-30 Thread Vladimir Rodionov
GC limit overhead exceeded is usually sign of either inadequate heap size (not the case here) or application produces garbage (temp objects) faster than garbage collector collects them - GC consumes most CPU cycles. 17G of Java heap is quite large for many application and is above "safe and recomme

Re: Spark + Tableau

2014-10-30 Thread Bojan Kostic
I'm connecting to it remotly with tableau/beeline. On Thu Oct 30 16:51:13 2014 GMT+0100, Denny Lee [via Apache Spark User List] wrote: > > > When you are starting the thrift server service - are you connecting to it > locally or is this on a remote server when you use beeline and/or Tableau? >

Best way to partition RDD

2014-10-30 Thread shahab
Hi. I am running an application in the Spark which first loads data from Cassandra and then performs some map/reduce jobs. val srdd = sqlContext.sql("select * from mydb.mytable " ) I noticed that the "srdd" only has one partition . no matter how big is the data loaded form Cassandra. So I perfo

Re: issue on applying SVM to 5 million examples.

2014-10-30 Thread Jimmy
Watch the app manager it should tell you what's running and taking awhile... My guess it's a "distinct" function on the data. J Sent from my iPhone > On Oct 30, 2014, at 8:22 AM, peng xia wrote: > > Hi, > > > > Previous we have applied SVM algorithm in MLlib to 5 million records (600 > mb

Re: Spark + Tableau

2014-10-30 Thread Denny Lee
When you are starting the thrift server service - are you connecting to it locally or is this on a remote server when you use beeline and/or Tableau? On Thu, Oct 30, 2014 at 8:00 AM, Bojan Kostic wrote: > I use beta driver SQL ODBC from Databricks. > > > > -- > View this message in context: > ht

RE: problem with start-slaves.sh

2014-10-30 Thread Pagliari, Roberto
I also didn’t realize I was trying to bring up the 2ndNameNode as a slave.. that might be an issue as well.. Thanks, From: Yana Kadiyska [mailto:yana.kadiy...@gmail.com] Sent: Thursday, October 30, 2014 11:27 AM To: Pagliari, Roberto Cc: user@spark.apache.org Subject: Re: problem with start-sla

Re: problem with start-slaves.sh

2014-10-30 Thread Yana Kadiyska
Roberto, I don't think shark is an issue -- I have shark server running on a node that also acts as a worker. What you can do is turn off shark server, just run start-all to start your spark cluster. then you can try bin/spark-shell --master and see if you can successfully run some "hello world" s

Re: Algebird using spark-shell

2014-10-30 Thread Ian O'Connell
Whats the error with the 2.10 version of algebird? On Thu, Oct 30, 2014 at 12:49 AM, thadude wrote: > I've tried: > > . /bin/spark-shell --jars algebird-core_2.10-0.8.1.jar > > scala> import com.twitter.algebird._ > import com.twitter.algebird._ > > scala> import HyperLogLog._ > import HyperLog

issue on applying SVM to 5 million examples.

2014-10-30 Thread peng xia
Hi, Previous we have applied SVM algorithm in MLlib to 5 million records (600 mb), it takes more than 25 minutes to finish. The spark version we are using is 1.0 and we were running this program on a 4 nodes cluster. Each node has 4 cpu cores and 11 GB RAM. The 5 million records only have two d

Using a Database to persist and load data from

2014-10-30 Thread Asaf Lahav
Hi Ladies and Gents, I would like to know what are the options I have if I would like to leverage Spark code I already have written to use a DB (Vertica) as its store/datasource. The data is of tabular nature. So any relational DB can essentially be used. Do I need to develop a context? If yes, ho

Re: Spark + Tableau

2014-10-30 Thread Bojan Kostic
I use beta driver SQL ODBC from Databricks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Tableau-tp17720p17727.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

Re: Spark + Tableau

2014-10-30 Thread Jimmy
What ODBC driver are you using? We recently got the Hortonworks JODBC drivers working on a Windows box but was having issues with Mac Sent from my iPhone > On Oct 30, 2014, at 4:23 AM, Bojan Kostic wrote: > > I'm testing beta driver from Databricks for Tableua. > And unfortunately i enco

Returned type of Broadcast variable is byte array

2014-10-30 Thread Stephen Boesch
As a template for creating a broadcast variable, the following code snippet within mllib was used: val bcIdf = dataset.context.broadcast(idf) dataset.mapPartitions { iter => val thisIdf = bcIdf.value The new code follows that model: import org.apache.spark.mllib.linalg.{Vector =>

Re: GC Issues with randomSplit on large dataset

2014-10-30 Thread Ilya Ganelin
The split is something like 30 million into 2 milion partitions. The reason that it becomes tractable is that after I perform the Cartesian on the split data and operate on it I don't keep the full results - I actually only keep a tiny fraction of that generated dataset - making the overall dataset

Re: Getting vector values

2014-10-30 Thread Sean Owen
Call toArray on the Vector and print that, or toBreeze. On Oct 30, 2014 12:38 PM, "Andrejs Abele" wrote: > Hi, > > I'm new to Mllib and spark. I'm trying to use tf-idf and use those values > for term ranking. > I'm getting tf values in vector format, but how can get the values of > vector? > >

Getting vector values

2014-10-30 Thread Andrejs Abele
Hi, I'm new to Mllib and spark. I'm trying to use tf-idf and use those values for term ranking. I'm getting tf values in vector format, but how can get the values of vector? val sc = new SparkContext(conf) val documents: RDD[Seq[String]] = sc.textFile("/home/andrejs/Datasets/dbpedia/test.t

Spark + Tableau

2014-10-30 Thread Bojan Kostic
I'm testing beta driver from Databricks for Tableua. And unfortunately i encounter some issues. While beeline connection works without problems, Tableau can't connect to spark thrift server. Error from driver(Tableau): Unable to connect to the ODBC Data Source. Check that the necessary drivers are

Re: Spark Debugging

2014-10-30 Thread Akhil Das
Thanks Best Regards On Thu, Oct 30, 2014 at 1:43 PM, Akhil Das wrote: > Awesome. > > Thanks > Best Regards > > On Thu, Oct 30, 2014 at 1:30 PM, Naveen Kumar Pokala < > npok...@spcapitaliq.com> wrote: > >> Thanks Akhil, It is working. >> >> >> >> Regards, >> >> Naveen >> >> >> >> *From:* Akhil Da

sharing RDDs between PySpark and Scala

2014-10-30 Thread rok
I'm processing some data using PySpark and I'd like to save the RDDs to disk (they are (k,v) RDDs of strings and SparseVector types) and read them in using Scala to run them through some other analysis. Is this possible? Thanks, Rok -- View this message in context: http://apache-spark-user-l

Re: BUG: when running as "extends App", closures don't capture variables

2014-10-30 Thread Sean Owen
Very coincidentally I ran into something equally puzzling yesterday where something was bizarrely null when it can't have been in a Spark program that extends App. I also changed to use main() and it works fine. So definitely some issue here. If nobody makes a JIRA before I get home I'll do it. On

NonSerializable Exception in foreachRDD

2014-10-30 Thread Harold Nguyen
Hi all, In Spark Streaming, when I do "foreachRDD" on my DStreams, I get a NonSerializable exception when I try to do something like: DStream.foreachRDD( rdd => { var sc.parallelize(Seq(("test", "blah"))) }) Is there any way around that ? Thanks, Harold

Re: Algebird using spark-shell

2014-10-30 Thread thadude
I've tried: . /bin/spark-shell --jars algebird-core_2.10-0.8.1.jar scala> import com.twitter.algebird._ import com.twitter.algebird._ scala> import HyperLogLog._ import HyperLogLog._ scala> import com.twitter.algebird.HyperLogLogMonoid import com.twitter.algebird.HyperLogLogMonoid scala> val

Re: GC Issues with randomSplit on large dataset

2014-10-30 Thread Sean Owen
Can you be more specific about numbers? I am not sure that splitting helps so much in the end, in that it has the same effect as executing a smaller number at a time of the large number of tasks that the full cartesian join would generate. The full join is probably intractable no matter what in thi

Re: Java api overhead?

2014-10-30 Thread Sean Owen
Lots of tuples get used in Spark itself and its Scala-based implementation anyway. This is the type of (a,b) in Scala. You can use the profiler to see where these are being allocated but my guess it is not just in translating to/from Java. You can also call directly into Scala implementations dire

Embedding Spark Masters+Zk, Workers, SparkContext, App in single JVM, clustered (sorta for symmetric deployment)

2014-10-30 Thread Aditya Varun Chadha
Hi, Is it possible to start a spark standalone master inside my own JVM? What I would like to do is the following:- in my own main (object MyApp extends App):- * Start zookeeper in embedded (and clustered) mode * Start a spark master in same jvm referring to the above zookeeper quorum for the HA