It seems that the slow reduce tasks are caused by slow shuffling. Here is
the logs regarding one slow reduce task:
14/06/11 23:42:45 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Got
remote block shuffle_69_88_18 after 5029 ms
14/06/11 23:42:45 INFO
Hi,
1. Does netty perform better than the basic method for shuffling? I found
the latency caused by shuffling in a streaming job is not stable with the
basic method.
2. However, after I turn on netty for shuffling, I can only see the results
for the first two batches, and then no output at all.
Hi Nick,
The great thing about any *unsalted* hashes is you can precompute them
ahead of time, then it is just a lookup to find the password which matches
the hash in seconds -- always makes for a more exciting demo than come
back in a few hours.
It is a no-brainer to write a generator function
I reran with master and looks like it is fixed.
2014-06-12 1:26 GMT+08:00 Michael Armbrust mich...@databricks.com:
I'd try rerunning with master. It is likely you are running into
SPARK-1994 https://issues.apache.org/jira/browse/SPARK-1994.
Michael
On Wed, Jun 11, 2014 at 3:01 AM,
Thanks for verifying!
On Thu, Jun 12, 2014 at 12:28 AM, Pei-Lun Lee pl...@appier.com wrote:
I reran with master and looks like it is fixed.
2014-06-12 1:26 GMT+08:00 Michael Armbrust mich...@databricks.com:
I'd try rerunning with master. It is likely you are running into
SPARK-1994
This actually what I've already mentioned - with rainbow tables kept in
memory it could be really fast!
Marek
2014-06-12 9:25 GMT+02:00 Michael Cutler mich...@tumra.com:
Hi Nick,
The great thing about any *unsalted* hashes is you can precompute them
ahead of time, then it is just a lookup
Hi,
Maybe this is a newbie question: How to read a snappy-compressed text file?
The OS is Windows 7.
Currently, I've done the following steps:
1. Built Hadoop 2.4.0 with snappy option.
'hadoop checknative' command displays the following line:
snappy: true
Hi,
Can you give us a little more insight on how you used that file to solve
your problem ?
We're having the same OOM as you were and haven't been able to solve it yet.
Thanks
--
View this message in context:
The goal of rdd.persist is to created a cached rdd that breaks the DAG
lineage. Therefore, computations *in the same job* that use that RDD can
re-use that intermediate result, but it's not meant to survive between job
runs.
for example:
val baseData =
Hi,
The problem was due to a pre-built/binary Tachyon-0.4.1 jar in the
SPARK_CLASSPATH, and that Tachyon jar had been built against
Hadoop-1.0.4.Building the Tachyon against Hadoop-2.0.0 resolved the issue.
Thanks
On Wed, Jun 11, 2014 at 11:34 PM, Marcelo Vanzin van...@cloudera.com
wrote:
@villu: thank you for your help. In prommis I gonna try it! thats cools :-)
do you know also the other way around from pmml to a model object in spark?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/pmml-with-augustus-tp7313p7473.html
Sent from the Apache
Is there anyone else who is facing this problem of writing to HBase when
running Spark on YARN mode or Standalone mode using this example?
If not, then do I need to explicitly, specify something in the classpath?
Regards,
Gaurav
On Wed, Jun 11, 2014 at 1:53 PM, Gaurav Dasgupta
Hi,
How to use SequenceFileRDDFunctions.saveAsSequenceFile() in Java?
A simple example will be a great help.
Thanks in advance.
Hi,
1. The performance is based on your hardware and system configurations,
you can test it yourself. In my test, the two shuffle implementations have no
special performance difference in latest version.
2. That’s correct to turn on netty based shuffle, and there’s no shuffle
fetch
Hi, all
Why does the Spark 1.0.0 official doc remove how to build Spark with
corresponding Hadoop version?
It means that if I don't need to specify the Hadoop version with I build my
Spark 1.0.0 with `sbt/sbt assembly`?
Regards,
Wang Hao(王灏)
CloudTeam | School of Software Engineering
Shanghai
Toby, #saveAsTextFile() and #saveAsObjectFile() are probably what you want
for your use case. As for Parquet support, that's newly arrived in Spark
1.0.0 together with SparkSQL so continue to watch this space.
Gerard's suggestion to look at JobServer, which you can generalize as
building a
RE:
Given that our agg sizes will exceed memory, we expect to cache them to
disk, so save-as-object (assuming there are no out of the ordinary
performance issues) may solve the problem, but I was hoping to store data
is a column orientated format. However I think this in general is not
possible
You need a use case where a lot of computation is applied to a little data.
How about any of the various distributed computing projects out there?
Although the SETI@home use case seems like a cool example, I doubt you want
to reimplement its client.
It might be far simpler to reimplement a search
Hi,
On 06/12/2014 05:47 PM, Toby Douglass wrote:
In these future jobs, when I come to load the aggregted RDD, will Spark
load and only load the columns being accessed by the query? or will Spark
load everything, to convert it into an internal representation, and then
execute the query?
I don't know if this matters, but I'm looking at the web site that spark
puts up, and I see under the streaming tab:
- *Started at: *Thu Jun 12 11:42:10 EDT 2014
- *Time since start: *6 minutes 3 seconds
- *Network receivers: *1
- *Batch interval: *5 seconds
- *Processed batches:
On Thu, Jun 12, 2014 at 4:48 PM, Andre Schumacher
schum...@icsi.berkeley.edu wrote:
On 06/12/2014 05:47 PM, Toby Douglass wrote:
In these future jobs, when I come to load the aggregted RDD, will Spark
load and only load the columns being accessed by the query? or will
Spark
load
Are you able to use HadoopInputoutput reader for hbase in new hadoop Api
reader?
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Thu, Jun 12, 2014 at 7:49 AM, gaurav.dasgupta gaurav.d...@gmail.com
wrote:
Is there anyone
Hi Wang Hao,
This is not removed. We moved it here:
http://spark.apache.org/docs/latest/hadoop-third-party-distributions.html
If you're building with SBT, and you don't specify the
SPARK_HADOOP_VERSION, then it defaults to 1.0.4.
Andrew
2014-06-12 6:24 GMT-07:00 Hao Wang wh.s...@gmail.com:
Not sure if this is what you're looking for, but have you looked at java's
ProcessBuilder? You can do something like
for (line - lines) {
val command = line.split( ) // You may need to deal with quoted strings
val process = new ProcessBuilder(command)
// redirect output of process to main
You can use ForeachRDD then access RDD data.
Hope this works for you.
Gianluca
On 12 Jun 2014, at 10:06, Wolfinger, Fred
fwolfin...@cyberpointllc.commailto:fwolfin...@cyberpointllc.com wrote:
Good morning.
I have a question related to Spark Streaming. I have reduced some data down to
a
I learned this from my co-worker, but it is relevant here.
Spark has lazy evaluation by default, which means that all of your code
does not get executed until you run your saveAsTextFile, which does not
tell you much about where the problem is occurring. In order to debug this
better, you might
@Marcelo: The command ./bin/spark-shell --jars jar1,jar2,etc,etc did not
work for me on a linux machine
What I did is to append the class path in the bin/compute-classpath.sh
file. Ran the script, then started the spark shell, and that worked
Thanks
Shivani
On Wed, Jun 11, 2014 at 10:52 AM,
Spark friends,
Yesterday the Mesos community announced the program
http://mesos.apache.org/blog/mesoscon-2014-program-announced/ for
#MesosCon http://events.linuxfoundation.org/events/mesoscon, the ever
Apache Mesos conference taking place August 21-22, 2014 in Chicago, IL.
Given the close
Hi,
When we have multiple runs of a program writing to the same output file, the
execution fails if the output directory already exists from a previous run.
Is there some way we can have it overwrite the existing directory, so that
we dont have to manually delete it after each run?
Thanks for
Hi, SK
For 1.0.0 you have to delete it manually
in 1.0.1 there will be a parameter to enable overwriting
https://github.com/apache/spark/pull/947/files
Best,
--
Nan Zhu
On Thursday, June 12, 2014 at 1:57 PM, SK wrote:
Hi,
When we have multiple runs of a program writing to the same
you are using spark streaming?
master = “local[n]” where n 1?
Best,
--
Nan Zhu
On Wednesday, June 11, 2014 at 4:23 AM, gaurav.dasgupta wrote:
Hi Kanwaldeep,
I have tried your code but arrived into a problem. The code is working fine
in local mode. But if I run the same code in
The old behavior (A) was dangerous, so it's good that (B) is now the
default. But in some cases I really do want to replace the old data, as per
(C). For example, I may rerun a previous computation (perhaps the input
data was corrupt and I'm rerunning with good input).
Currently I have to write
Actually this has been merged to the master branch
https://github.com/apache/spark/pull/947
--
Nan Zhu
On Thursday, June 12, 2014 at 2:39 PM, Daniel Siegmann wrote:
The old behavior (A) was dangerous, so it's good that (B) is now the default.
But in some cases I really do want to
To get the streaming latency I just look at the stats on the application
drivers UI webpage.
I don't know if you can do that programatically, but you could CURL and parse
the page if you had to.
Jeremy Lee BCompSci (Hons)
The Unorthodox Engineers
On 10 Jun 2014, at 3:36 pm, Yingjun Wu
I am also seeing similar problem when trying to continue job using saved
checkpoint. Can somebody help in solving this problem?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-on-reading-checkpoint-files-tp7306p7507.html
Sent from the
I do not want the behavior of (A) - that is dangerous and should only be
enabled to account for legacy code. Personally, I think this option should
eventually be removed.
I want the option (C), to have Spark delete any existing part files before
creating any new output. I don't necessarily want
Creating AMIs from scratch is a complete pain in the ass. If you have a spare
week, sure. I understand why the team avoids it.
The easiest way is probably to spin up a working instance and then use Amazons
save as new AMI, but that has some major limitations, especially with
software not
Gents,
I have been bringing up a cluster on EC2 using the spark_ec2.py script.
This works if the cluster has a single slave.
This fails if the cluster has sixteen slaves, during the work to transfer
the SSH key to the slaves. I cannot currently bring up a large cluster.
Can anyone shed any
Yeah, we badly need new AMIs that include at a minimum package/security
updates and Python 2.7. There is an open issue to track the 2.7 AMI update
https://issues.apache.org/jira/browse/SPARK-922, at least.
On Thu, Jun 12, 2014 at 3:34 PM, unorthodox.engine...@gmail.com wrote:
Creating AMIs
On Thu, Jun 12, 2014 at 8:50 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
Yes, you need Python 2.7 to run spark-ec2 and most AMIs come with 2.6
Ah, yes - I mean to say, Amazon Linux.
.Have you tried either:
1. Retrying launch with the --resume option?
2. Increasing the
I too have seen cached RDDs not hit 100%, even when they are DISK_ONLY.
Just saw that yesterday in fact. In some cases RDDs I expected didn't show
up in the list at all. I have no idea if this is an issue with Spark or
something I'm not understanding about how persist works (probably the
latter).
Hey everyone,
I'm having some trouble increasing the default storage size for a broadcast
variable. It looks like it defaults to a little less than 512MB every time,
and I can't figure out which configuration to change to increase this.
INFO storage.MemoryStore: Block broadcast_0 stored as
If you are launching your application with spark-submit you can manually edit
the spark-class file to make it 1g as baseline.
It’s pretty easy to do and to figure out how once you open the file.
This worked for me even if it’s not a final solution of course.
Gianluca
On 12 Jun 2014, at 15:16,
Hi.
I'm not sure if messages like this are appropriate in this list; I just
want to share with you an application I am working on. This is my personal
project which I started to learn more about Spark and Scala, and, if it
succeeds, to contribute it to the Spark community.
Maybe someone will find
spark-env.sh doesn't seem to contain any settings related to memory size :(
I will continue searching for a solution and will post it if I find it :)
Thank you, anyway
On Wed, Jun 11, 2014 at 12:19 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:
It might be that conf/spark-env.sh on EC2 is
Hi Asian,
I'm not sure if mlbase code is maintained for the current spark
master. The following is the code we use for standardization in my
company. I'm intended to clean up, and submit a PR. You could use it
for now.
def standardize(data: RDD[Vector]): RDD[Vector] = {
val summarizer =
Hi, I'm consistently getting NullPointerExceptions when trying to use
String val objects defined in my main application -- even for broadcast
vals!
I'm deploying on a standalone cluster with a master and 4 workers on the
same machine, which is not the machine I'm submitting from.
The following
I create a Java custom receiver and then call
ssc.receiverStream(new MyReceiver(localhost, 8081)); // where ssc is the
JavaStreamingContext
I am expecting that the receiver's onStart method gets called but it does
not.
Can anyone give me some guidance? What am I not doing?
Here's the
Great! I was going to implement one of my own - but I may not need to do that
any more :)
I haven't had a chance to look deep into your code but I would recommend
accepting an RDD[Double,Double] as well, instead of just a file.
val data = IOHelper.readDataset(sc, /path/to/my/data.csv)
And other
That stack trace is quite similar to the one that is generated when trying
to do a collect within a closure. In this case, it feels wrong to
collect in a closure, but I wonder what's reason behind the NPE.
Curious to know whether they are related.
Here's a very simple example:
rrd1.flatMap(x=
ah, I see,
I think it’s hard to do something like fs.delete() in spark code (it’s scary as
we discussed in the previous PR )
so if you want (C), I guess you have to do some delete work manually
Best,
--
Nan Zhu
On Thursday, June 12, 2014 at 3:31 PM, Daniel Siegmann wrote:
I do
Hi Congrui Yi,
Spark is implemented in Scala, so all Scala features are first available in
Scala/Java. PySpark is a python wrapper for the Scala code, so it won't
always have the latest features. This is especially true for the Machine
learning library.
Eric
--
View this message in context:
Maybe It would be nice that unpersist() ‘triggers’ the computations of other
rdds that depends on it but not yet computed.
The pseudo code can be as follows:
unpersist()
{
if (this rdd has not been persisted)
return;
for (all rdds that depends on this rdd but not yet computed)
Are you launching this using our EC2 scripts? Or have you set up a cluster by
hand?
Matei
On Jun 12, 2014, at 2:32 PM, Aliaksei Litouka aliaksei.lito...@gmail.com
wrote:
spark-env.sh doesn't seem to contain any settings related to memory size :( I
will continue searching for a solution and
You should be able to see the streaming tab in the Spark web ui (running on
port 4040) if you have created StreamingContext and you are using Spark 1.0
TD
On Thu, Jun 12, 2014 at 1:06 AM, Ravi Hemnani raviiihemn...@gmail.com
wrote:
Hey,
I did
FYI: Here is a related discussion
http://apache-spark-user-list.1001560.n3.nabble.com/Persist-and-unpersist-td6437.html
about this.
On Thu, Jun 12, 2014 at 8:10 PM, innowireless TaeYun Kim
taeyun@innowireless.co.kr wrote:
Maybe It would be nice that unpersist() ‘triggers’ the computations
Currently I use rdd.count() for forceful computation, as Nick Pentreath
suggested.
I think that it will be nice to have a method that forcefully computes a rdd,
so that the unnecessary rdds are safely unpersist()ed.
Let’s think a case that a rdd_a is a parent of both:
(1) a short-term
Thanks for answer.
Yes, I tried to launch an interactive REPL in the middle of my application
:)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/use-spark-shell-in-the-source-tp7453p7539.html
Sent from the Apache Spark User List mailing list archive at
(I¡¯ve clarified the statement (1) of my previous mail. See below.)
From: innowireless TaeYun Kim [mailto:taeyun@innowireless.co.kr]
Sent: Friday, June 13, 2014 10:05 AM
To: user@spark.apache.org
Subject: RE: Question about RDD cache, unpersist, materialization
Currently I use
On Thu, Jun 12, 2014 at 9:10 PM, Zongheng Yang zonghen...@gmail.com wrote:
Hi Toby,
It is usually the case that even if the EC2 console says the nodes are
up, they are not really fully initialized. For 16 nodes I have found
`--wait 800` to be the norm that makes things work.
It seems so!
Yeah, we should probably add that. Feel free to file a JIRA.
You can get it manually by calling sc.setJobDescription with the query text
before running the query.
Michael
On Thu, Jun 12, 2014 at 5:49 PM, shlee0605 shlee0...@gmail.com wrote:
In shark, the input SQL string was shown at the
Vipul,
Thanks for your feedback. As far as I understand, mean RDD[(Double,
Double)] (note the parenthesis), and each of these Double values is
supposed to contain one coordinate of a point. It limits us to
2-dimensional space, which is not suitable for many tasks. I want the
algorithm to be able
This issue is resolved.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/specifying-fields-for-join-tp7528p7544.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Matei,
Thanks for the answer this clarifies this very much. Based on my usage I
would use combineByKey, since the output is another custom data structures.
I found out my issues with combineByKey were relieved after doing more
tuning with the level of parallelism. I've found that it really
Bump
On Mon, Jun 9, 2014 at 3:22 PM, Michael Chang m...@tellapart.com wrote:
Hi all,
I'm seeing exceptions that look like the below in Spark 0.9.1. It looks
like I'm running out of inodes on my machines (I have around 300k each in a
12 machine cluster). I took a quick look and I'm seeing
Hi Michael,
I think you can set up spark.cleaner.ttl=xxx to enable time-based metadata
cleaner, which will clean old un-used shuffle data when it is timeout.
For Spark 1.0 another way is to clean shuffle data using weak reference
(reference tracking based, configuration is
Hi Sguj,
Could you give me the exception stack?
I test it on my laptop and find that it gets the wrong FileSystem. It should
be DistributedFileSystem, but it finds the RawLocalFileSystem.
If we get the same exception stack, I'll try to fix it.
Here is my exception stack:
The scripts for Spark 1.0 actually specify this property in
/root/spark/conf/spark-defaults.conf
I didn't know that this would override the --executor-memory flag, though,
that's pretty odd.
On Thu, Jun 12, 2014 at 6:02 PM, Aliaksei Litouka
aliaksei.lito...@gmail.com wrote:
Yes, I am
Hi, Andrew
Got it, Thanks!
Hao
Regards,
Wang Hao(王灏)
CloudTeam | School of Software Engineering
Shanghai Jiao Tong University
Address:800 Dongchuan Road, Minhang District, Shanghai, 200240
Email:wh.s...@gmail.com
On Fri, Jun 13, 2014 at 12:42 AM, Andrew Or and...@databricks.com wrote:
Hi
I want to take multiple passes through my data in mapPartitions. However, the
iterator only allows you to take one pass through the data. If I transformed
the iterator into an array using iter.toArray, it is too slow, since it
copies all the data into a new scala array. Also it takes twice the
70 matches
Mail list logo