Can't access remote Hive table from spark

2015-01-25 Thread guxiaobo1982
Hi, I built and started a single node standalone Spark 1.2.0 cluster along with a single node Hive 0.14.0 instance installed by Ambari 1.17.0. On the Spark and Hive node I can create and query tables inside Hive, and on remote machines I can submit the SparkPi example to the Spark master. But I

where storagelevel DISK_ONLY persists RDD to

2015-01-25 Thread Larry Liu
I would like to persist RDD TO HDFS or NFS mount. How to change the location?

Shuffle to HDFS

2015-01-25 Thread Larry Liu
How to change shuffle output to HDFS or NFS?

Re: Analyzing data from non-standard data sources (e.g. AWS Redshift)

2015-01-25 Thread Denis Mikhalkin
Hi Nicholas, thanks for your reply. I checked spark-redshift - it's just for the unload data files stored on hadoop, not for online result sets from DB. Do you know of any example of a custom RDD which fetches the data on the fly (not reading from HDFS)? Thanks. Denis From: Nicholas Chammas

graph.inDegrees including zero values

2015-01-25 Thread scharissis
Hi, If a vertex has no in-degree then Spark's GraphOp 'inDegree' does not return it at all. Instead, it would be very useful to me to be able to have that vertex returned with an in-degree of zero. What's the best way to achieve this using the GraphX API? For example, given a graph with nodes A,B

Re: where storagelevel DISK_ONLY persists RDD to

2015-01-25 Thread Charles Feduke
I think you want to instead use `.saveAsSequenceFile` to save an RDD to someplace like HDFS or NFS it you are attempting to interoperate with another system, such as Hadoop. `.persist` is for keeping the contents of an RDD around so future uses of that particular RDD don't need to recalculate its c

Re: Analyzing data from non-standard data sources (e.g. AWS Redshift)

2015-01-25 Thread Charles Feduke
I'm facing a similar problem except my data is already pre-sharded in PostgreSQL. I'm going to attempt to solve it like this: - Submit the shard names (database names) across the Spark cluster as a text file and partition it so workers get 0 or more - hopefully 1 - shard name. In this case you co

Re: what is the roadmap for Spark SQL dialect in the coming releases?

2015-01-25 Thread Niranda Perera
Thanks Michael. A clarification. So the HQL dialect provided by HiveContext, does it use catalyst optimizer? I though HiveContext is only related to Hive integration in Spark! Would be grateful if you could clarify this cheers On Sun, Jan 25, 2015 at 1:23 AM, Michael Armbrust wrote: > I gener

foreachActive functionality

2015-01-25 Thread kundan kumar
Can someone help me to understand the usage of "foreachActive" function introduced for the Vectors. I am trying to understand its usage in MultivariateOnlineSummarizer class for summary statistics. sample.foreachActive { (index, value) => if (value != 0.0) { if (currMax(index) < v

SVD in pyspark ?

2015-01-25 Thread Andreas Rhode
Is the distributed SVD functionality exposed to Python yet? Seems it's only available to scala or java, unless I am missing something, looking for a pyspark equivalent to org.apache.spark.mllib.linalg.SingularValueDecomposition In case it's not there yet, is there a way to make a wrapper to call

Re: SVD in pyspark ?

2015-01-25 Thread Chip Senkbeil
Hi Andreas, With regard to the notebook interface, you can use the Spark Kernel ( https://github.com/ibm-et/spark-kernel) as the backend for an IPython 3.0 notebook. The kernel is designed to be the foundation for interactive applications connecting to Apache Spark and uses the IPython 5.0 messag

Re: foreachActive functionality

2015-01-25 Thread Reza Zadeh
The idea is to unify the code path for dense and sparse vector operations, which makes the codebase easier to maintain. By handling (index, value) tuples, you can let the foreachActive method take care of checking if the vector is sparse or dense, and running a foreach over the values. On Sun, Jan

Re: graph.inDegrees including zero values

2015-01-25 Thread Ankur Dave
You can do this using leftJoin, as collectNeighbors [1] does: graph.vertices.leftJoin(graph.inDegrees) { (vid, attr, inDegOpt) => inDegOpt.getOrElse(0) } [1] https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/GraphOps.scala#L145 Ankur On Sun, Jan 25, 2

Re: Results never return to driver | Spark Custom Reader

2015-01-25 Thread Harihar Nahak
Hi Yana, As per my custom split code, only three splits submit to the system. So three executors are sufficient for that. but it had run 8 executors. First three executors logs show the exact output what I want(i did put some syso in console to debug the code), but next five are have some other an

Re: what is the roadmap for Spark SQL dialect in the coming releases?

2015-01-25 Thread Michael Armbrust
Yeah, the HiveContext is just a SQLContext that is extended with HQL, access to a metastore, hive UDFs and hive serdes. The query execution however is identical to a SQLContext. On Sun, Jan 25, 2015 at 7:24 AM, Niranda Perera wrote: > Thanks Michael. > > A clarification. So the HQL dialect prov

RE: Can't access remote Hive table from spark

2015-01-25 Thread Skanda Prasad
This happened to me as well, putting hive-site.xml inside conf doesn't seem to work. Instead I added /etc/hive/conf to SPARK_CLASSPATH and it worked. You can try this approach. -Skanda -Original Message- From: "guxiaobo1982" Sent: ‎25-‎01-‎2015 13:50 To: "user@spark.apache.org" Subjec

key already cancelled error

2015-01-25 Thread ilaxes
Hi everyone, I'm writing a program that update a cassandra table. I've writen a first shot where I update the table row by row from a rdd trhough a map. Now I want to build a batch of updates using the same kind of syntax as in this thread : https://groups.google.com/forum/#!msg/spark-users/LUb

Re: Spark webUI - application details page

2015-01-25 Thread ilaxes
Hi, I've a similar problem. I want to see the detailed logs of Completed Applications so I've set in my program : set("spark.eventLog.enabled","true"). set("spark.eventLog.dir","file:/tmp/spark-events") but when I click on the application in the webui, I got a page with the message : Application

Re: Eclipse on spark

2015-01-25 Thread Harihar Nahak
Download pre build binary for window and attached all required jars in your project eclipsclass-path and go head with your eclipse. make sure you have same java version On 25 January 2015 at 07:33, riginos [via Apache Spark User List] < ml-node+s1001560n21350...@n3.nabble.com> wrote: > How to com

Pairwise Processing of a List

2015-01-25 Thread Steve Nunez
Spark Experts, I've got a list of points: List[(Float, Float)]) that represent (x,y) coordinate pairs and need to sum the distance. It's easy enough to compute the distance: case class Point(x: Float, y: Float) { def distance(other: Point): Float = sqrt(pow(x - other.x, 2) + pow(y - other

Re: Pairwise Processing of a List

2015-01-25 Thread Tobias Pfeiffer
Hi, On Mon, Jan 26, 2015 at 9:32 AM, Steve Nunez wrote: > I’ve got a list of points: List[(Float, Float)]) that represent (x,y) > coordinate pairs and need to sum the distance. It’s easy enough to compute > the distance: > Are you saying you want all combinations (N^2) of distances? That shoul

Re: Pairwise Processing of a List

2015-01-25 Thread Joseph Lust
So you’ve got a point A and you want the sum of distances between it and all other points? Or am I misunderstanding you? // target point, can be Broadcast global sent to all workers val tarPt = (10,20) val pts = Seq((2,2),(3,3),(2,3),(10,2)) val rdd= sc.parallelize(pts) rdd.map( pt => Math.sqrt(

Re: Pairwise Processing of a List

2015-01-25 Thread Steve Nunez
Not combinations, linear distances, e.g., given: List[ (x1,y1), (x2,y2), (x3,y3) ], compute the sum of: distance (x1,y2) and (x2,y2) and distance (x2,y2) and (x3,y3) Imagine that the list of coordinate point comes from a GPS and describes a trip. - Steve From: Joseph Lust mailto:jl...@mc10inc.

Re: Pairwise Processing of a List

2015-01-25 Thread Sean Owen
If this is really about just Scala Lists, then a simple answer (using tuples of doubles) is: val points: List[(Double,Double)] = ... val distances = for (p1 <- points; p2 <- points) yield { val dx = p1._1 - p2._1 val dy = p1._2 - p2._2 math.sqrt(dx*dx + dy*dy) } distances.sum / 2 It's "/ 2"

Re: Serializability: for vs. while loops

2015-01-25 Thread Tobias Pfeiffer
Aaron, On Thu, Jan 15, 2015 at 5:05 PM, Aaron Davidson wrote: > Scala for-loops are implemented as closures using anonymous inner classes > which are instantiated once and invoked many times. This means, though, > that the code inside the loop is actually sitting inside a class, which > confuses

Re: Pairwise Processing of a List

2015-01-25 Thread Sean Owen
(PS the Scala code I posted is a poor way to do it -- it would materialize the entire cartesian product in memory. You can use .iterator or .view to fix that.) Ah, so you want sum of distances between successive points. val points: List[(Double,Double)] = ... points.sliding(2).map { case List(p1,

Re: Pairwise Processing of a List

2015-01-25 Thread Tobias Pfeiffer
Sean, On Mon, Jan 26, 2015 at 10:28 AM, Sean Owen wrote: > Note that RDDs don't really guarantee anything about ordering though, > so this only makes sense if you've already sorted some upstream RDD by > a timestamp or sequence number. > Speaking of order, is there some reading on guarantees an

Re: spark streaming with checkpoint

2015-01-25 Thread Tobias Pfeiffer
Hi, On Tue, Jan 20, 2015 at 8:16 PM, balu.naren wrote: > I am a beginner to spark streaming. So have a basic doubt regarding > checkpoints. My use case is to calculate the no of unique users by day. I > am using reduce by key and window for this. Where my window duration is 24 > hours and slide

Re: [SQL] Conflicts in inferred Json Schemas

2015-01-25 Thread Tobias Pfeiffer
Hi, On Thu, Jan 22, 2015 at 2:26 AM, Corey Nolet wrote: > Let's say I have 2 formats for json objects in the same file > schema1 = { "location": "12345 My Lane" } > schema2 = { "location":{"houseAddres":"1234 My Lane"} } > > From my tests, it looks like the current inferSchema() function will en

Re: Spark webUI - application details page

2015-01-25 Thread Joseph Lust
Perhaps you need to set this in your spark-defaults.conf so that¹s it¹s already set when your slave/worker processes start. -Joe On 1/25/15, 6:50 PM, "ilaxes" wrote: >Hi, > >I've a similar problem. I want to see the detailed logs of Completed >Applications so I've set in my program : >set("spar

Re: foreachActive functionality

2015-01-25 Thread DB Tsai
PS, we were using Breeze's activeIterator originally as you can see in the old code, but we found there are overhead there, so we implement our own implementation which results 4x faster. See https://github.com/apache/spark/pull/3288 for detail. Sincerely, DB Tsai

Re: Spark 1.2 – How to change Default (Random) port ….

2015-01-25 Thread Shailesh Birari
Can anyone please let me know ? I don't want to open all ports on n/w. So, am interested in the property by which this new port I can configure. Shailesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-2-How-to-change-Default-Random-port-tp21306p2

Re: Spark 1.2 – How to change Default (Random) port ….

2015-01-25 Thread Aaron Davidson
This was a regression caused by Netty Block Transfer Service. The fix for this just barely missed the 1.2 release, and you can see the associated JIRA here: https://issues.apache.org/jira/browse/SPARK-4837 Current master has the fix, and the Spark 1.2.1 release will have it included. If you don't

RE: Shuffle to HDFS

2015-01-25 Thread Shao, Saisai
Hi Larry, I don’t think current Spark’s shuffle can support HDFS as a shuffle output. Anyway, is there any specific reason to spill shuffle data to HDFS or NFS, this will severely increase the shuffle time. Thanks Jerry From: Larry Liu [mailto:larryli...@gmail.com] Sent: Sunday, January 25, 20

Re: spark streaming with checkpoint

2015-01-25 Thread Balakrishnan Narendran
Yeah use streaming to gather the incoming logs and write to log file then run a spark job evry 5 minutes to process the counts. Got it. Thanks a lot. On 07:07, Mon, 26 Jan 2015 Tobias Pfeiffer wrote: > Hi, > > On Tue, Jan 20, 2015 at 8:16 PM, balu.naren wrote: > >> I am a beginner to spark

Re: spark streaming with checkpoint

2015-01-25 Thread Balakrishnan Narendran
- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: Analyzing data from non-standard data sources (e.g. AWS Redshift)

2015-01-25 Thread Charles Feduke
I've got my solution working: https://gist.github.com/cfeduke/3bca88ed793ddf20ea6d I couldn't actually perform the steps I outlined in the previous message in this thread because I would ultimately be trying to serialize a SparkContext to the workers to use during the generation of 1..*n* JdbcRDD

Lost task - connection closed

2015-01-25 Thread octavian.ganea
Hi, I am running a program that executes map-reduce jobs in a loop. The first time the loop runs, everything is ok. After that, it starts giving the following error, first it gives it for one task, then for more tasks and eventually the entire program fails: 15/01/26 01:41:25 WARN TaskSetManager:

Re: Eclipse on spark

2015-01-25 Thread Jörn Franke
I recommend using a build tool within eclipse, such as Gradle or Maven Le 24 janv. 2015 19:34, "riginos" a écrit : > How to compile a Spark project in Scala IDE for Eclipse? I got many scala > scripts and i no longer want to load them from scala-shell what can i do? > > > > -- > View this message

Re: Shuffle to HDFS

2015-01-25 Thread Larry Liu
Hi,Jerry Thanks for your reply. The reason I have this question is that in Hadoop, mapper intermediate output (shuffle) will be stored in HDFS. I think the default location for spark is /tmp I think. Larry On Sun, Jan 25, 2015 at 9:44 PM, Shao, Saisai wrote: > Hi Larry, > > > > I don’t think

Re: Lost task - connection closed

2015-01-25 Thread Aaron Davidson
Please take a look at the executor logs (on both sides of the IOException) to see if there are other exceptions (e.g., OOM) which precede this one. Generally, the connections should not fail spontaneously. On Sun, Jan 25, 2015 at 10:35 PM, octavian.ganea wrote: > Hi, > > I am running a program t

Re: where storagelevel DISK_ONLY persists RDD to

2015-01-25 Thread Larry Liu
Hi, Charles Thanks for your reply. Is it possible to persist RDD to HDFS? What is the default location to persist RDD with storagelevel DISK_ONLY? On Sun, Jan 25, 2015 at 6:26 AM, Charles Feduke wrote: > I think you want to instead use `.saveAsSequenceFile` to save an RDD to > someplace like H

RE: Shuffle to HDFS

2015-01-25 Thread Shao, Saisai
Hey Larry, I don’t think Hadoop will put shuffle output in HDFS, instead it’s behavior is the same as what Spark did, store mapper output (shuffle) data on local disks. You might misunderstood something ☺. Thanks Jerry From: Larry Liu [mailto:larryli...@gmail.com] Sent: Monday, January 26, 201

RE: where storagelevel DISK_ONLY persists RDD to

2015-01-25 Thread Shao, Saisai
No, current RDD persistence mechanism do not support putting data on HDFS. The directory is spark.local.dirs. Instead you can use checkpoint() to save the RDD on HDFS. Thanks Jerry From: Larry Liu [mailto:larryli...@gmail.com] Sent: Monday, January 26, 2015 3:08 PM To: Charles Feduke Cc: u...@s

No AMI for Spark 1.2 using ec2 scripts

2015-01-25 Thread hajons
Hi, When I try to launch a standalone cluster on EC2 using the scripts in the ec2 directory for Spark 1.2, I get the following error: Could not resolve AMI at: https://raw.github.com/mesos/spark-ec2/v4/ami-list/us-east-1/pvm It seems there is not yet any AMI available on EC2. Any ideas when the

Announcement: Generalized K-Means Clustering on Spark

2015-01-25 Thread derrickburns
This project generalizes the Spark MLLIB K-Means clusterer to support clustering of dense or sparse, low or high dimensional data using distance functions defined by Bregman divergences. https://github.com/derrickburns/generalized-kmeans-clustering -- View this message in context: http://apach