I'm not aware of a way to do that. CTRL-C is generally handled by the
terminal emulator and just sends a SIGINT to the process; and the JVM
doesn't allow you to ignore SIGINT. Setting a signal handler on the
parent process doesn't work either.
And there are other combinations you need to care
BTW this might be useful:
http://superuser.com/questions/160388/change-bash-shortcut-keys-such-as-ctrl-c
On Fri, Mar 13, 2015 at 10:53 AM, Sean Owen so...@cloudera.com wrote:
This is more of a function of your shell, since ctrl-C will send
SIGINT. I'm sure you can change that, depending on your
Hi All,
I am very new in Spark world. Just started some test coding from last week. I
am using spark-1.2.1-bin-hadoop2.4 and scala coding.
I am having issues while using Date and decimal data types. Following is my
code that I am simply running on scala prompt. I am trying to define a table
and
Hi,
I would suggest you use LBFGS, as I think the step size is hurting you. You
can run the same thing in LBFGS as:
```
val algorithm = new LBFGS(new LeastSquaresGradient(), new SimpleUpdater())
val initialWeights = Vectors.dense(Array.fill(3)(
scala.util.Random.nextDouble()))
val weights =
This is more of a function of your shell, since ctrl-C will send
SIGINT. I'm sure you can change that, depending on your shell or
terminal, but it's not a Spark-specific answer.
On Fri, Mar 13, 2015 at 5:44 PM, Adamantios Corais
adamantios.cor...@gmail.com wrote:
this doesn't solve my problem...
Are you running the code before or after closing the spark context? It must
be after stopping streaming context (without spark context) and before
stopping spark context.
Cc'ing Sean. He may have more insights.
On Mar 13, 2015 11:19 AM, Jose Fernandez jfernan...@sdl.com wrote:
Thanks, I am
Hi,
I'm working on a RDD of a tuple of objects which represent trees (Node
containing a hashmap of nodes). I'm trying to aggregate these trees over the
RDD.
Let's take an example, 2 graphs :
C - D - B - A - D - B - E
F - E - B - A - D - B - F
I'm spliting each graphs according to the vertex A
Hum increased it to 1024 but doesn't help still have the same problem :(
2015-03-13 18:28 GMT+01:00 Eugen Cepoi cepoi.eu...@gmail.com:
The one by default 0.07 of executor memory. I'll try increasing it and
post back the result.
Thanks
2015-03-13 18:09 GMT+01:00 Ted Yu yuzhih...@gmail.com:
Kudos to the whole team for such a significant achievement!
On Fri, Mar 13, 2015 at 10:00 AM, Patrick Wendell pwend...@gmail.com
wrote:
Hi All,
I'm happy to announce the availability of Spark 1.3.0! Spark 1.3.0 is
the fourth release on the API-compatible 1.X line. It is Spark's
largest
You can type :quit.
On Fri, Mar 13, 2015 at 10:29 AM, Adamantios Corais
adamantios.cor...@gmail.com wrote:
Hi,
I want change the default combination of keys that exit the Spark shell
(i.e. CTRL + C) to something else, such as CTRL + H?
Thank you in advance.
// Adamantios
--
Marcelo
Hello,
While running an SVMClassifier in spark, is there any way to print out
the most important features the model selected? I see there's a way to
print out the weights but I am not sure how to associate them with the
input features, by name or in any other way. Any help would be
Hi,
I want change the default combination of keys that exit the Spark shell
(i.e. CTRL + C) to something else, such as CTRL + H?
Thank you in advance.
*// Adamantios*
Might be related: what's the value for spark.yarn.executor.memoryOverhead ?
See SPARK-6085
Cheers
On Fri, Mar 13, 2015 at 9:45 AM, Eugen Cepoi cepoi.eu...@gmail.com wrote:
Hi,
I have a job that hangs after upgrading to spark 1.2.1 from 1.1.1. Strange
thing, the exact same code does work
Awesome! Thanks!
*Sebastián Ramírez*
Head of Software Development
http://www.senseta.com
Tel: (+571) 795 7950 ext: 1012
Cel: (+57) 300 370 77 10
Calle 73 No 7 - 06 Piso 4
Linkedin: co.linkedin.com/in/tiangolo/
Twitter: @tiangolo https://twitter.com/tiangolo
Email:
Hi
I have an RDD: RDD[(String, scala.Iterable[(Long, Int)])] which I want to print
into a file, a file for each key string.
I tried to trigger a repartition of the RDD by doing group by on it. The
grouping gives RDD[(String, scala.Iterable[Iterable[(Long, Int)]])] so I
flattened that:
Do you have a small test case that can reproduce the out of memory error ?
I have also seen some errors on large scale experiments but haven't managed
to narrow it down.
Thanks
Shivaram
On Fri, Mar 13, 2015 at 6:20 AM, Jaonary Rabarisoa jaon...@gmail.com
wrote:
It runs faster but there is some
The one by default 0.07 of executor memory. I'll try increasing it and post
back the result.
Thanks
2015-03-13 18:09 GMT+01:00 Ted Yu yuzhih...@gmail.com:
Might be related: what's the value for spark.yarn.executor.memoryOverhead ?
See SPARK-6085
Cheers
On Fri, Mar 13, 2015 at 9:45 AM,
Thanks, I am using 1.2.0 so it looks like I am affected by the bug you
described.
It also appears that the shutdown hook doesn’t work correctly when the driver
is running in YARN. According to the logs it looks like the SparkContext is
closed and the code you suggested is never executed and
Thanks a lot Burak, that helped.
On Fri, Mar 13, 2015 at 1:55 PM, Burak Yavuz brk...@gmail.com wrote:
Hi,
I would suggest you use LBFGS, as I think the step size is hurting you.
You can run the same thing in LBFGS as:
```
val algorithm = new LBFGS(new LeastSquaresGradient(), new
I am running spark streaming standalone in ec2 and I am trying to run
wordcount example from my desktop. The program is unable to connect to the
master, in the logs I see, which seems to be an issue with hostname.
15/03/13 17:37:44 ERROR EndpointWriter: dropping message [class
I am running LogisticRegressionWithLBFGS. I got these lines on my console:
2015-03-12 17:38:03,897 ERROR breeze.optimize.StrongWolfeLineSearch |
Encountered bad values in function evaluation. Decreasing step size to 0.5
2015-03-12 17:38:03,967 ERROR breeze.optimize.StrongWolfeLineSearch |
Markus,
Thanks. That makes sense. I was able to get this to work with spark-shell
passing in the git built jar. I did notice that I couldn't get
AvroSaver.save to work with SQLContext, but it works with HiveContext. Not
sure if that is an issue, but for me, it is fine.
Once again, thanks for
Are you running the driver program (that is your application process) in
your desktop and trying to run it on the cluster in EC2?
It could very well be a hostname mismatch in some way due to the all the
public hostname, private hostname, private ip, firewall craziness of ec2.
You have to probably
All,
Does anyone have any reference to a publication or other, informal sources
(blogs, notes), showing
performance of Spark on HDFS vs. other shared (Lustre, etc.) or other file
system (NFS).
I need this for formal performance research.
We are currently doing a research into this on a very
Is the number of top K elements you want to keep small? That is, is K
small? In which case, you can
1. either do it in the driver on the array
DStream.foreachRDD ( rdd = {
val topK = rdd.top(K) ;
// use top K
})
2. Or, you can use the topK to create another RDD using sc.makeRDD
I probably did not do a good enough job explaining the problem. If
you used Maven with the
default Maven repository you have an old version of spark-avro that does
not contain AvroSaver and does not have the saveAsAvro method implemented:
Assuming you use the default Maven repo location:
cd
I am trying to process events from a flume avro sink, but i keep getting this
same error. I am just running it locally using flumes avro-client. With the
following commands to start the job and client. It seems like it should be
a configuration problems since its a NoSuchMethodError, but
Hi Everyone,
Thank you for your suggestions, based on that I was able to move forward.
I am now generating a Geohash for all the lats and lons in our reference
data and then creating a trie of all the Geohash's
I am then broadcasting that trie and then using it to search the nearest
Geohash for
If you want to access the keys in an RDD that is partition by key, then you
can use RDD.mapPartition(), which gives you access to the whole partition
as an iteratorkey, value. You have the option of maintaing the
partitioning information or not by setting the preservePartitioning flag in
I still don't follow how spark is partitioning data in multi node
environment. Is there a document on how spark does portioning of data. For
eg: in word count eg how is spark distributing words to multiple nodes?
On Fri, Mar 13, 2015 at 3:01 PM, Tathagata Das t...@databricks.com wrote:
If you
When I ran Spark SQL query (a simple group by query) via hive support, I
have seen lots of failures in map phase.
I am not sure if that is specific to Spark SQL or general.
Any one has seen such errors before?
java.io.IOException: org.apache.spark.SparkException: Error sending message
[message
BTW, I'll add that we are hoping to publish a new version of the Avro
library for Spark 1.3 shortly. It should have improved support for writing
data both programmatically and from SQL.
On Fri, Mar 13, 2015 at 2:01 PM, Kevin Peng kpe...@gmail.com wrote:
Markus,
Thanks. That makes sense. I
Hi
I was running a logistic regression algorithm on a 8 nodes spark cluster,
each node has 8 cores and 56 GB Ram (each node is running a windows
system). And the spark installation driver has 1.9 TB capacity. The dataset
I was training on are has around 40 million records with around 6600
Thanks Daoyuan.
What do you mean by running some native command, I never thought that hive will
run without an computing engine like Hadoop MR or spark. Thanks.
bit1...@163.com
From: Wang, Daoyuan
Date: 2015-03-13 16:39
To: bit1...@163.com; user
Subject: RE: Explanation on the Hive in the
Seems you want to use array for the field of providers, like
providers:[{id:
...}, {id:...}] instead of providers:{{id: ...}, {id:...}}
On Fri, Mar 13, 2015 at 7:45 PM, kpeng1 kpe...@gmail.com wrote:
Hi All,
I was noodling around with loading in a json file into spark sql's hive
context and
Yin,
Yup thanks. I fixed that shortly after I posted and it worked.
Thanks,
Kevin
On Fri, Mar 13, 2015 at 8:28 PM, Yin Huai yh...@databricks.com wrote:
Seems you want to use array for the field of providers, like
providers:[{id:
...}, {id:...}] instead of providers:{{id: ...}, {id:...}}
Have you registered your class with kryo ?
See core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala
and core/src/test/scala/org/apache/spark/SparkConfSuite.scala
On Fri, Mar 13, 2015 at 10:52 AM, ilaxes ila...@hotmail.com wrote:
Hi,
I'm working on a RDD of a tuple of objects
Hi, there
You may want to check your hbase config.
e.g. the following property can be changed to /hbase
property
namezookeeper.znode.parent/name
value/hbase-unsecure/value
/property
fightf...@163.com
From: HARIPRIYA AYYALASOMAYAJULA
Date: 2015-03-14 10:47
To: user
In HBaseTest.scala:
val conf = HBaseConfiguration.create()
You can add some log (for zookeeper.znode.parent, e.g.) to see if the
values from hbase-site.xml are picked up correctly.
Please use pastebin next time you want to post errors.
Which Spark release are you using ?
I assume it contains
I am a PhD student trying to understand the internals of Spark, so that I can
make some modifications to it. I am trying to understand how the aggregation
of the distributed datasets(through the network) onto the driver node works.
I would very much appreciate it if someone could point me towards
Hi, sparkers,
When I read the code about computing resources allocation for the newly
submitted application in the Master#schedule method, I got a question about
data locality:
// Pack each app into as few nodes as possible until we've assigned all its
cores
for (worker - workers if
If you want to learn about how Spark partitions the data based on keys,
here is a recent talk about that
http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-better-spark-programs?related=1
Of course you can read the original Spark paper
Hi all,
I've been using Spark 1.1.0 for a while, and now would like to upgrade to
Spark 1.1.1 or above. However, it throws the following errors:
18:05:31.522 [sparkDriver-akka.actor.default-dispatcher-3hread] ERROR
TaskSchedulerImpl - Lost executor 37 on hcompute001: remote Akka client
It is during function evaluation in the line search, the value is either
infinite or NaN, which may be caused too large step size. In the code, the step
is reduced to half.
Thanks.
Zhan Zhang
On Mar 13, 2015, at 2:41 PM, cjwang c...@cjwang.us wrote:
I am running LogisticRegressionWithLBFGS.
Hi All,
I try to run a sorting on a r3.2xlarge instance on AWS. I just try to run it
as a single node cluster for test. The data I use to sort is around 4GB and
sit on S3, output will also on S3.
I just connect spark-shell to the local cluster and run the code in the
script (because I just
In spark-streaming, the consumers will fetch data and put it into 'blocks'.
Each block becomes a partition of the rdd generated during that batch
interval.
The size of each is block controlled by the conf:
'spark.streaming.blockInterval'. That is, the amount of data the consumer
can collect in
Hi ,
I am playing around with Spark SQL 1.3 and noticed that max function does
not give the correct result i.e doesn't give the maximum value. The same
query works fine in Spark SQL 1.2 .
Is any one aware of this issue ?
Regards,
Gaurav
--
View this message in context:
Hi All,
I was noodling around with loading in a json file into spark sql's hive
context and I noticed that I get the following message after loading in the
json file:
PhysicalRDD [_corrupt_record#0], MappedRDD[5] at map at JsonRDD.scala:47
I am using the HiveContext to load in the json file
Please try the following command:
mvn -DskipTests -Phadoop-2.4 -Pyarn -Phive clean package
Cheers
On Fri, Mar 13, 2015 at 5:57 PM, sequoiadb mailing-list-r...@sequoiadb.com
wrote:
guys, is there any easier way to build all modules by mvn ?
right now if I run “mvn package” in spark root
And one thing forget to mention, even I have this exception and the result
is not well format in my target folder (part of them are there, rest are
under different folder structure of _tempoary folder). In the webUI of
spark-shell, it is still be marked as successful step. I think this is a
bug?
We have many text files that we need to read in parallel. We can create a comma
delimited list of files to pass in to sparkContext.textFile(fileList). The list
can get very large (maybe 1) and is all on hdfs.
The question is: what is the most performant way to read them? Should they be
guys, is there any easier way to build all modules by mvn ?
right now if I run “mvn package” in spark root directory I got:
[INFO] Reactor Summary:
[INFO]
[INFO] Spark Project Parent POM ... SUCCESS [ 8.327 s]
[INFO] Spark Project Networking ...
?Sorry forgot to attach traceback.
Regards
Rene Castberg
Fra: Castberg, René Christian
Sendt: 13. mars 2015 07:13
Til: user@spark.apache.org
Kopi: gen tang
Emne: SV: Pyspark Hbase scan.
?Hi,
I have now successfully managed to test this in a local spark
?Hi,
I have now successfully managed to test this in a local spark session.
But i am having a huge programming getting this to work with Horton Works
technical preview. I think that there is an incompatability with the way YARN
has been compiled.
After changing the hbase version, and
That totally depends on your data size and your cluster setup.
Thanks
Best Regards
On Thu, Mar 12, 2015 at 7:32 PM, Udbhav Agarwal udbhav.agar...@syncoms.com
wrote:
Hi,
What is query time for join query on hbase with spark sql. Say tables in
hbase have 0.5 million records each. I am
You could also add SPARK_MASTER_IP to bind to a specific host/IP so that it
won't get confused with those hosts in your /etc/hosts file.
Thanks
Best Regards
On Fri, Mar 13, 2015 at 12:00 PM, Du Li l...@yahoo-inc.com.invalid wrote:
Hi Spark community,
I searched for a way to configure a
You can put the files from any where it just the streaming application only
picks the file having a timestamp greater than the batch was started.
No you dont need to set any properties for this to work. This is the
default behaviour of the spark streaming.
On Fri, Mar 13, 2015 at 9:38 AM,
I have been trying to do the same, but where exactly do you suggest
creating the static object? As creating it inside for each RDD will
ultimately result in same problem and not creating it inside will result in
serializability issue.
On Fri, Mar 13, 2015 at 11:47 AM, Tathagata Das
Thanks Akhil,
What more info should I give so we can estimate query time in my scenario?
Thanks,
Udbhav Agarwal
From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: 13 March, 2015 12:01 PM
To: Udbhav Agarwal
Cc: user@spark.apache.org
Subject: Re: spark sql performance
That totally depends
Forget about my last message. I was confused. Spark 1.2.1 + Scala 2.10.4
started by SBT console command also failed with this error. However running
from a standard spark shell works.
Jianshi
On Fri, Mar 13, 2015 at 2:46 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Hmm... look like the
Lets say am using 4 machines with 3gb ram. My data is customers records with 5
columns each in two tables with 0.5 million records. I want to perform join
query on these two tables.
Thanks,
Udbhav Agarwal
From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: 13 March, 2015 12:16 PM
To:
The size/type of your data, and your cluster configuration would be fine i
think.
Thanks
Best Regards
On Fri, Mar 13, 2015 at 12:07 PM, Udbhav Agarwal udbhav.agar...@syncoms.com
wrote:
Thanks Akhil,
What more info should I give so we can estimate query time in my scenario?
*Thanks,*
Hmm... look like the console command still starts a Spark 1.3.0 with Scala
2.11.6 even I changed them in build.sbt.
So the test with 1.2.1 is not valid.
Jianshi
On Fri, Mar 13, 2015 at 2:34 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
I've confirmed it only failed in console started by
i have the same issue running spark sql code from eclipse workspace. If
you run your code from the command line (with a packaged jar) or from
Intellij, I bet it should work.
IMHO This is some how related to eclipse env, but would love to know how
to fix it (whether via eclipse conf, or via a
Okay Akhil! Thanks for the information.
Thanks,
Udbhav Agarwal
From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: 13 March, 2015 12:34 PM
To: Udbhav Agarwal
Cc: user@spark.apache.org
Subject: Re: spark sql performance
Can't say that unless you try it.
Thanks
Best Regards
On Fri, Mar
Hi Spark community,
I searched for a way to configure a heterogeneous cluster because the need
recently emerged in my project. I didn't find any solution out there. Now I
have thought out a solution and thought it might be useful to many other people
with similar needs. Following is a blog post
Can't say that unless you try it.
Thanks
Best Regards
On Fri, Mar 13, 2015 at 12:32 PM, Udbhav Agarwal udbhav.agar...@syncoms.com
wrote:
Sounds great!
So can I expect response time in milliseconds from the join query over
this much data ( 0.5 million in each table) ?
*Thanks,*
Sounds great!
So can I expect response time in milliseconds from the join query over this
much data ( 0.5 million in each table) ?
Thanks,
Udbhav Agarwal
From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: 13 March, 2015 12:27 PM
To: Udbhav Agarwal
Cc: user@spark.apache.org
Subject: Re:
Additionally I wanted to tell that presently I was running the query on one
machine with 3gm ram and the join query was taking around 6 seconds.
Thanks,
Udbhav Agarwal
From: Udbhav Agarwal
Sent: 13 March, 2015 12:45 PM
To: 'Akhil Das'
Cc: user@spark.apache.org
Subject: RE: spark sql performance
I've confirmed it only failed in console started by SBT.
I'm using sbt-spark-package plugin, and the initialCommands look like this
(I added implicit sqlContext to it):
show console::initialCommands
[info] println(Welcome to\n +
[info] __\n +
[info] / __/__ ___
You can see where it is spending time, whether there is any GC Time etc
from the webUI (running on 4040),Also how many cores are you having?
Thanks
Best Regards
On Fri, Mar 13, 2015 at 1:05 PM, Udbhav Agarwal udbhav.agar...@syncoms.com
wrote:
Additionally I wanted to tell that presently I was
Here's a simple consumer which does that
https://github.com/dibbhatt/kafka-spark-consumer/
Thanks
Best Regards
On Thu, Mar 12, 2015 at 10:28 PM, ColinMc colin.mcqu...@shiftenergy.com
wrote:
Hi,
How do you use KafkaUtils to specify a specific partition? I'm writing
customer Marathon jobs
Make sure your hadoop is running on port 8020, you can check it in your
core-site.xml file and use that URI like:
sc.textFile(hdfs://myhost:myport/data)
Thanks
Best Regards
On Fri, Mar 13, 2015 at 5:15 AM, Lau, Kawing (GE Global Research)
kawing@ge.com wrote:
Hi
I was running with
So you can cache upto 8GB of data in memory (hope your data size of one
table is 2GB), then it should be pretty fast with SparkSQL. Also i'm
assuming you have around 12-16 cores total.
Thanks
Best Regards
On Fri, Mar 13, 2015 at 12:22 PM, Udbhav Agarwal udbhav.agar...@syncoms.com
wrote:
Okay Akhil.
I am having 4 cores cpu.(2.4 ghz)
Thanks,
Udbhav Agarwal
From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: 13 March, 2015 1:07 PM
To: Udbhav Agarwal
Cc: user@spark.apache.org
Subject: Re: spark sql performance
You can see where it is spending time, whether there is any GC
Hi,
I am trying to build spark 1.3 from source. After I executed|
mvn -DskipTests clean package|
I tried to use shell but got this error
[root@sandbox spark]# ./bin/spark-shell
Exception in thread main java.lang.IllegalStateException: No
assemblies found in
Hi.
I have a query in Spark SQL and I can not covert a value to BIGINT:
CAST(column AS BIGINT) or
CAST(0 AS BIGINT)
The output is:
Exception in thread main java.lang.RuntimeException: [34.62] failure:
``DECIMAL'' expected but identifier BIGINT found
Thanks!!
Regards.
Miguel Ángel
Liancheng also found out that the Spark jars are not included in the
classpath of URLClassLoader.
Hmm... we're very close to the truth now.
Jianshi
On Fri, Mar 13, 2015 at 6:03 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
I'm almost certain the problem is the ClassLoader.
So adding
Hi bit1129,
1, hive in spark assembly removed most dependencies of hive on Hadoop to avoid
conflicts.
2, this hive is used to run some native command, which does not rely on spark
or mapreduce.
Thanks,
Daoyuan
From: bit1...@163.com [mailto:bit1...@163.com]
Sent: Friday, March 13, 2015 4:24 PM
FWIW I do not see any problem. Are you sure the build completely
successfully? you should see an assembly then in
assembly/target/scala-2.10, so should check that.
On Fri, Mar 13, 2015 at 8:26 AM, Patcharee Thongtra
patcharee.thong...@uni.no wrote:
Hi,
I am trying to build spark 1.3 from
I guess it's a ClassLoader issue. But I have no idea how to debug it. Any
hints?
Jianshi
On Fri, Mar 13, 2015 at 3:00 PM, Eric Charles e...@apache.org wrote:
i have the same issue running spark sql code from eclipse workspace. If
you run your code from the command line (with a packaged jar)
There is the PR https://github.com/apache/spark/pull/2077 for doing this.
On Fri, Mar 13, 2015 at 6:42 AM, t1ny wbr...@gmail.com wrote:
Hi all,
We are looking for a tool that would let us visualize the DAG generated by
a
Spark application as a simple graph.
This graph would represent the
These are quite different creatures. You have a distributed set of
Strings, but want a local stream of bytes, which involves three
conversions:
- collect data to driver
- concatenate strings in some way
- encode strings as bytes according to an encoding
Your approach is OK but might be faster to
Hi all,
We are looking for a tool that would let us visualize the DAG generated by a
Spark application as a simple graph.
This graph would represent the Spark Job, its stages and the tasks inside
the stages, with the dependencies between them (either narrow or shuffle
dependencies).
The Spark
Hi,
I normally use dstream.transform whenever I need to use methods which are
available in RDD API but not in streaming API. e.g. dstream.transform(x =
x.sortByKey(true))
But there are other RDD methods which return types other than RDD. e.g.
dstream.transform(x = x.top(5)) top here returns
Hi, sparkers,
I am kind of confused about hive in the spark assembly. I think hive in the
spark assembly is not the same thing as Hive On Spark(Hive On Spark, is meant
to run hive using spark execution engine).
So, my question is:
1. What is the difference between Hive in the spark assembly
Hi,
I have two RDD[Vector], both Vector are spares and of the form:
(id, value)
id indicates the position of the value in the vector space. I want to
apply dot product on two of such RDD[Vector] and get a scale value. The
none exist values are treated as zero.
Any convenient tool to do
I'm almost certain the problem is the ClassLoader.
So adding
fork := true
solves problems for test and run.
The problem is how can I fork a JVM for sbt console? fork in console :=
true seems not working...
Jianshi
On Fri, Mar 13, 2015 at 4:35 PM, Jianshi Huang jianshi.hu...@gmail.com
Thanks Sean,
I forgot to mention that the data is too big to be collected on the driver.
So yes your proposition would work in theory but in my case I cannot hold
all the data in the driver memory, therefore it wouldn't work.
I guess the crucial point to to do the collect in a lazy way and in
Hm, aren't you able to use the SparkContext here? DStream operations
happen on the driver. So you can parallelize() the result?
take() won't work as it's not the same as top()
On Fri, Mar 13, 2015 at 11:23 AM, Akhil Das ak...@sigmoidanalytics.com wrote:
Like this?
Can anyone have a look on this question? Thanks.
bit1...@163.com
From: bit1...@163.com
Date: 2015-03-13 16:24
To: user
Subject: Explanation on the Hive in the Spark assembly
Hi, sparkers,
I am kind of confused about hive in the spark assembly. I think hive in the
spark assembly is not the
It runs faster but there is some drawbacks. It seems to consume more
memory. I get java.lang.OutOfMemoryError: Java heap space error if I don't
have a sufficient partitions for a fixed amount of memory. With the older
(ampcamp) implementation for the same data size I didn't get it.
On Thu, Mar
Like this?
dtream.repartition(1).mapPartitions(it = it.take(5))
Thanks
Best Regards
On Fri, Mar 13, 2015 at 4:11 PM, Laeeq Ahmed laeeqsp...@yahoo.com.invalid
wrote:
Hi,
I normally use dstream.transform whenever I need to use methods which are
available in RDD API but not in streaming
Hi Team,
I'm getting below exception for saving the results into hadoop.
*Code :*
rdd.saveAsTextFile(hdfs://localhost:9000/home/rajesh/data/result.rdd)
Could you please help me how to resolve this issue.
15/03/13 17:19:31 INFO spark.SparkContext: Starting job: saveAsTextFile at
Thanks, it makes sense.
On Thursday, March 12, 2015, Daniel Siegmann daniel.siegm...@teamaol.com
wrote:
Join causes a shuffle (sending data across the network). I expect it will
be better to filter before you join, so you reduce the amount of data which
is sent across the network.
Note this
Hi,
repartition is expensive. Looking for an efficient to do this.
Regards,Laeeq
On Friday, March 13, 2015 12:24 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Like this?
dtream.repartition(1).mapPartitions(it = it.take(5))
ThanksBest Regards
On Fri, Mar 13, 2015 at 4:11 PM,
Hi,
Earlier my code was like follwing but slow due to repartition. I want top K of
each window in a stream.
val counts = keyAndValues.map(x =
math.round(x._3.toDouble)).countByValueAndWindow(Seconds(4), Seconds(4))val
topCounts = counts.repartition(1).map(_.swap).transform(rdd =
OK, then you do not want to collect() the RDD. You can get an iterator, yes.
There is no such thing as making an Iterator into an InputStream. An
Iterator is a sequence of arbitrary objects; an InputStream is a
channel to a stream of bytes.
I think you can employ similar Guava / Commons utilities
Hello,
I cannot process graph with 230M edges.
I cloned apache.spark, build it and then tried it on cluster.
I used Spark Standalone Cluster:
-5 machines (each has 12 cores/32GB RAM)
-'spark.executor.memory' == 25g
-'spark.driver.memory' == 3g
Graph has 231359027 edges. And its file
spark1.1.1 + Hbase (CDH5.3.1). 20 nodes each with 4 cores and 32G memory. 3
cores and 16G memory were assigned to spark in each worker node. Standalone
mode. Data set is 3.8 T. wondering how to fix this. Thanks!
1 - 100 of 111 matches
Mail list logo