Hi,
I built and started a single node standalone Spark 1.2.0 cluster along with a
single node Hive 0.14.0 instance installed by Ambari 1.17.0. On the Spark and
Hive node I can create and query tables inside Hive, and on remote machines I
can submit the SparkPi example to the Spark master. But I
I would like to persist RDD TO HDFS or NFS mount. How to change the
location?
How to change shuffle output to HDFS or NFS?
Hi Nicholas,
thanks for your reply. I checked spark-redshift - it's just for the unload data
files stored on hadoop, not for online result sets from DB.
Do you know of any example of a custom RDD which fetches the data on the fly
(not reading from HDFS)?
Thanks.
Denis
From: Nicholas Chammas
Hi,
If a vertex has no in-degree then Spark's GraphOp 'inDegree' does not return
it at all. Instead, it would be very useful to me to be able to have that
vertex returned with an in-degree of zero.
What's the best way to achieve this using the GraphX API?
For example, given a graph with nodes A,B
I think you want to instead use `.saveAsSequenceFile` to save an RDD to
someplace like HDFS or NFS it you are attempting to interoperate with
another system, such as Hadoop. `.persist` is for keeping the contents of
an RDD around so future uses of that particular RDD don't need to
recalculate its c
I'm facing a similar problem except my data is already pre-sharded in
PostgreSQL.
I'm going to attempt to solve it like this:
- Submit the shard names (database names) across the Spark cluster as a
text file and partition it so workers get 0 or more - hopefully 1 - shard
name. In this case you co
Thanks Michael.
A clarification. So the HQL dialect provided by HiveContext, does it use
catalyst optimizer? I though HiveContext is only related to Hive
integration in Spark!
Would be grateful if you could clarify this
cheers
On Sun, Jan 25, 2015 at 1:23 AM, Michael Armbrust
wrote:
> I gener
Can someone help me to understand the usage of "foreachActive" function
introduced for the Vectors.
I am trying to understand its usage in MultivariateOnlineSummarizer class
for summary statistics.
sample.foreachActive { (index, value) =>
if (value != 0.0) {
if (currMax(index) < v
Is the distributed SVD functionality exposed to Python yet?
Seems it's only available to scala or java, unless I am missing something,
looking for a pyspark equivalent to
org.apache.spark.mllib.linalg.SingularValueDecomposition
In case it's not there yet, is there a way to make a wrapper to call
Hi Andreas,
With regard to the notebook interface, you can use the Spark Kernel (
https://github.com/ibm-et/spark-kernel) as the backend for an IPython 3.0
notebook. The kernel is designed to be the foundation for interactive
applications connecting to Apache Spark and uses the IPython 5.0 messag
The idea is to unify the code path for dense and sparse vector operations,
which makes the codebase easier to maintain. By handling (index, value)
tuples, you can let the foreachActive method take care of checking if the
vector is sparse or dense, and running a foreach over the values.
On Sun, Jan
You can do this using leftJoin, as collectNeighbors [1] does:
graph.vertices.leftJoin(graph.inDegrees) {
(vid, attr, inDegOpt) => inDegOpt.getOrElse(0)
}
[1]
https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/GraphOps.scala#L145
Ankur
On Sun, Jan 25, 2
Hi Yana,
As per my custom split code, only three splits submit to the system. So
three executors are sufficient for that. but it had run 8 executors. First
three executors logs show the exact output what I want(i did put some syso
in console to debug the code), but next five are have some other an
Yeah, the HiveContext is just a SQLContext that is extended with HQL,
access to a metastore, hive UDFs and hive serdes. The query execution
however is identical to a SQLContext.
On Sun, Jan 25, 2015 at 7:24 AM, Niranda Perera
wrote:
> Thanks Michael.
>
> A clarification. So the HQL dialect prov
This happened to me as well, putting hive-site.xml inside conf doesn't seem to
work. Instead I added /etc/hive/conf to SPARK_CLASSPATH and it worked. You can
try this approach.
-Skanda
-Original Message-
From: "guxiaobo1982"
Sent: 25-01-2015 13:50
To: "user@spark.apache.org"
Subjec
Hi everyone,
I'm writing a program that update a cassandra table.
I've writen a first shot where I update the table row by row from a rdd
trhough a map.
Now I want to build a batch of updates using the same kind of syntax as in
this thread :
https://groups.google.com/forum/#!msg/spark-users/LUb
Hi,
I've a similar problem. I want to see the detailed logs of Completed
Applications so I've set in my program :
set("spark.eventLog.enabled","true").
set("spark.eventLog.dir","file:/tmp/spark-events")
but when I click on the application in the webui, I got a page with the
message :
Application
Download pre build binary for window and attached all required jars in your
project eclipsclass-path and go head with your eclipse. make sure you have
same java version
On 25 January 2015 at 07:33, riginos [via Apache Spark User List] <
ml-node+s1001560n21350...@n3.nabble.com> wrote:
> How to com
Spark Experts,
I've got a list of points: List[(Float, Float)]) that represent (x,y)
coordinate pairs and need to sum the distance. It's easy enough to compute the
distance:
case class Point(x: Float, y: Float) {
def distance(other: Point): Float =
sqrt(pow(x - other.x, 2) + pow(y - other
Hi,
On Mon, Jan 26, 2015 at 9:32 AM, Steve Nunez wrote:
> I’ve got a list of points: List[(Float, Float)]) that represent (x,y)
> coordinate pairs and need to sum the distance. It’s easy enough to compute
> the distance:
>
Are you saying you want all combinations (N^2) of distances? That shoul
So you’ve got a point A and you want the sum of distances between it and all
other points? Or am I misunderstanding you?
// target point, can be Broadcast global sent to all workers
val tarPt = (10,20)
val pts = Seq((2,2),(3,3),(2,3),(10,2))
val rdd= sc.parallelize(pts)
rdd.map( pt => Math.sqrt(
Not combinations, linear distances, e.g., given: List[ (x1,y1), (x2,y2),
(x3,y3) ], compute the sum of:
distance (x1,y2) and (x2,y2) and
distance (x2,y2) and (x3,y3)
Imagine that the list of coordinate point comes from a GPS and describes a trip.
- Steve
From: Joseph Lust mailto:jl...@mc10inc.
If this is really about just Scala Lists, then a simple answer (using
tuples of doubles) is:
val points: List[(Double,Double)] = ...
val distances = for (p1 <- points; p2 <- points) yield {
val dx = p1._1 - p2._1
val dy = p1._2 - p2._2
math.sqrt(dx*dx + dy*dy)
}
distances.sum / 2
It's "/ 2"
Aaron,
On Thu, Jan 15, 2015 at 5:05 PM, Aaron Davidson wrote:
> Scala for-loops are implemented as closures using anonymous inner classes
> which are instantiated once and invoked many times. This means, though,
> that the code inside the loop is actually sitting inside a class, which
> confuses
(PS the Scala code I posted is a poor way to do it -- it would
materialize the entire cartesian product in memory. You can use
.iterator or .view to fix that.)
Ah, so you want sum of distances between successive points.
val points: List[(Double,Double)] = ...
points.sliding(2).map { case List(p1,
Sean,
On Mon, Jan 26, 2015 at 10:28 AM, Sean Owen wrote:
> Note that RDDs don't really guarantee anything about ordering though,
> so this only makes sense if you've already sorted some upstream RDD by
> a timestamp or sequence number.
>
Speaking of order, is there some reading on guarantees an
Hi,
On Tue, Jan 20, 2015 at 8:16 PM, balu.naren wrote:
> I am a beginner to spark streaming. So have a basic doubt regarding
> checkpoints. My use case is to calculate the no of unique users by day. I
> am using reduce by key and window for this. Where my window duration is 24
> hours and slide
Hi,
On Thu, Jan 22, 2015 at 2:26 AM, Corey Nolet wrote:
> Let's say I have 2 formats for json objects in the same file
> schema1 = { "location": "12345 My Lane" }
> schema2 = { "location":{"houseAddres":"1234 My Lane"} }
>
> From my tests, it looks like the current inferSchema() function will en
Perhaps you need to set this in your spark-defaults.conf so that¹s it¹s
already set when your slave/worker processes start.
-Joe
On 1/25/15, 6:50 PM, "ilaxes" wrote:
>Hi,
>
>I've a similar problem. I want to see the detailed logs of Completed
>Applications so I've set in my program :
>set("spar
PS, we were using Breeze's activeIterator originally as you can see in
the old code, but we found there are overhead there, so we implement
our own implementation which results 4x faster. See
https://github.com/apache/spark/pull/3288 for detail.
Sincerely,
DB Tsai
Can anyone please let me know ?
I don't want to open all ports on n/w. So, am interested in the property by
which this new port I can configure.
Shailesh
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-2-How-to-change-Default-Random-port-tp21306p2
This was a regression caused by Netty Block Transfer Service. The fix for
this just barely missed the 1.2 release, and you can see the associated
JIRA here: https://issues.apache.org/jira/browse/SPARK-4837
Current master has the fix, and the Spark 1.2.1 release will have it
included. If you don't
Hi Larry,
I don’t think current Spark’s shuffle can support HDFS as a shuffle output.
Anyway, is there any specific reason to spill shuffle data to HDFS or NFS, this
will severely increase the shuffle time.
Thanks
Jerry
From: Larry Liu [mailto:larryli...@gmail.com]
Sent: Sunday, January 25, 20
Yeah use streaming to gather the incoming logs and write to log file then
run a spark job evry 5 minutes to process the counts. Got it. Thanks a
lot.
On 07:07, Mon, 26 Jan 2015 Tobias Pfeiffer wrote:
> Hi,
>
> On Tue, Jan 20, 2015 at 8:16 PM, balu.naren wrote:
>
>> I am a beginner to spark
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
I've got my solution working:
https://gist.github.com/cfeduke/3bca88ed793ddf20ea6d
I couldn't actually perform the steps I outlined in the previous message in
this thread because I would ultimately be trying to serialize a
SparkContext to the workers to use during the generation of 1..*n* JdbcRDD
Hi,
I am running a program that executes map-reduce jobs in a loop. The first
time the loop runs, everything is ok. After that, it starts giving the
following error, first it gives it for one task, then for more tasks and
eventually the entire program fails:
15/01/26 01:41:25 WARN TaskSetManager:
I recommend using a build tool within eclipse, such as Gradle or Maven
Le 24 janv. 2015 19:34, "riginos" a écrit :
> How to compile a Spark project in Scala IDE for Eclipse? I got many scala
> scripts and i no longer want to load them from scala-shell what can i do?
>
>
>
> --
> View this message
Hi,Jerry
Thanks for your reply.
The reason I have this question is that in Hadoop, mapper intermediate
output (shuffle) will be stored in HDFS. I think the default location for
spark is /tmp I think.
Larry
On Sun, Jan 25, 2015 at 9:44 PM, Shao, Saisai wrote:
> Hi Larry,
>
>
>
> I don’t think
Please take a look at the executor logs (on both sides of the IOException)
to see if there are other exceptions (e.g., OOM) which precede this one.
Generally, the connections should not fail spontaneously.
On Sun, Jan 25, 2015 at 10:35 PM, octavian.ganea wrote:
> Hi,
>
> I am running a program t
Hi, Charles
Thanks for your reply.
Is it possible to persist RDD to HDFS? What is the default location to
persist RDD with storagelevel DISK_ONLY?
On Sun, Jan 25, 2015 at 6:26 AM, Charles Feduke
wrote:
> I think you want to instead use `.saveAsSequenceFile` to save an RDD to
> someplace like H
Hey Larry,
I don’t think Hadoop will put shuffle output in HDFS, instead it’s behavior is
the same as what Spark did, store mapper output (shuffle) data on local disks.
You might misunderstood something ☺.
Thanks
Jerry
From: Larry Liu [mailto:larryli...@gmail.com]
Sent: Monday, January 26, 201
No, current RDD persistence mechanism do not support putting data on HDFS.
The directory is spark.local.dirs.
Instead you can use checkpoint() to save the RDD on HDFS.
Thanks
Jerry
From: Larry Liu [mailto:larryli...@gmail.com]
Sent: Monday, January 26, 2015 3:08 PM
To: Charles Feduke
Cc: u...@s
Hi,
When I try to launch a standalone cluster on EC2 using the scripts in the
ec2 directory for Spark 1.2, I get the following error:
Could not resolve AMI at:
https://raw.github.com/mesos/spark-ec2/v4/ami-list/us-east-1/pvm
It seems there is not yet any AMI available on EC2. Any ideas when the
This project generalizes the Spark MLLIB K-Means clusterer to support
clustering of dense or sparse, low or high dimensional data using distance
functions defined by Bregman divergences.
https://github.com/derrickburns/generalized-kmeans-clustering
--
View this message in context:
http://apach
46 matches
Mail list logo