Hi --
I tried searching for eclipse spark plugin setup for developing with Spark
and there seems to be some information I can go with but I have not seen a
starter app or project to import into Eclipse and try it out. Can anyone
please point me to any Scala projects to import into Scala Eclipse
I’m running this query against RDD[Tweet], where Tweet is a simple case
class with 4 fields.
sqlContext.sql(
SELECT user, COUNT(*) as num_tweets
FROM tweets
GROUP BY user
ORDER BY
num_tweets DESC,
user ASC
;
).take(5)
The first time I run this, it throws the following:
You'll need to build SparkR to match the Spark version deployed on the
cluster. You can do that by changing the Spark version in SparkR's
build.sbt [1]. If you are using the Maven build you'll need to edit pom.xml
Thanks
Shivaram
[1]
I just running PageRank(included in GraphX) on a dataset which has 55876487
edges. I submit the application to YARN with options`--num-executors 30
--executor-memory 30g --driver-memory 10g --executor-cores 8`.
Thanks--发件人:Ankur
You might be hitting SPARK-1994
https://issues.apache.org/jira/browse/SPARK-1994, which is fixed in 1.0.1.
On Mon, Jul 14, 2014 at 11:16 PM, Nick Chammas nicholas.cham...@gmail.com
wrote:
I’m running this query against RDD[Tweet], where Tweet is a simple case
class with 4 fields.
In general it would be nice to be able to configure replication on a
per-job basis. Is there a way to do that without changing the config
values in the Hadoop conf/ directory between jobs? Maybe by modifying
OutputFormats or the JobConf ?
On Mon, Jul 14, 2014 at 11:12 PM, Matei Zaharia
I mean the query on the nested data such as JSON, not the nested query, sorry
for the misunderstanding.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Query-the-nested-JSON-data-With-Spark-SQL-1-0-1-tp9544p9726.html
Sent from the Apache Spark User List
Ah, good catch, that seems to be it.
I'd use 1.0.1, except I've been hitting up against SPARK-2471
https://issues.apache.org/jira/browse/SPARK-2471 with that version, which
doesn't let me access my data in S3. :(
OK, at least I know this has probably already been fixed.
Nick
On Tue, Jul 15,
Hmm, I'd like to clarify something from your comments, Tathagata.
Going forward, is Twitter Streaming functionality not supported from the
shell? What should users do if they'd like to process live Tweets from the
shell?
Nick
On Mon, Jul 14, 2014 at 11:50 PM, Nicholas Chammas
In general this should be supported using [] to access array data and .
to access nested fields. Is there something you are trying that isn't
working?
On Mon, Jul 14, 2014 at 11:25 PM, anyweil wei...@gmail.com wrote:
I mean the query on the nested data such as JSON, not the nested query,
Could you share the code of RecommendationALS and the complete
spark-submit command line options? Thanks! -Xiangrui
On Mon, Jul 14, 2014 at 11:23 PM, Srikrishna S srikrishna...@gmail.com wrote:
Using properties file: null
Main class:
RecommendationALS
Arguments:
_train.csv
_validation.csv
Thank you so much for the reply, here is my code.
1. val conf = new SparkConf().setAppName(Simple Application)
2. conf.setMaster(local)
3. val sc = new SparkContext(conf)
4. val sqlContext = new org.apache.spark.sql.SQLContext(sc)
5. import sqlContext.createSchemaRDD
6. val path1 =
I don't think MLlib supports model serialization/deserialization. You
got the error because the constructor is private. I created a JIRA for
this: https://issues.apache.org/jira/browse/SPARK-2488 and we try to
make sure it is implemented in v1.1. For now, you can modify the
KMeansModel and remove
Yes, just as my last post, using [] to access array data and . to access
nested fields seems not work.
BTW, i have deeped into the code of the current master branch.
spark / sql / catalyst / src / main / scala / org / apache / spark / sql /
catalyst / plans / logical / LogicalPlan.scala
from
Sorry for the trouble. There are two issues here:
- Parsing of repeated nested (i.e. something[0].field) is not supported in
the plain SQL parser. SPARK-2096
https://issues.apache.org/jira/browse/SPARK-2096
- Resolution is broken in the HiveQL parser. SPARK-2483
I am running this in standalone mode on a single machine. I built the spark
jar from scratch (sbt assembly) and then included that in my application
(the same process I have done for earlier versions).
thanks.
--
View this message in context:
This is (obviously) spark streaming, by the way.
On Mon, Jul 14, 2014 at 8:27 PM, Walrus theCat walrusthe...@gmail.com
wrote:
Hi,
I've got a socketTextStream through which I'm reading input. I have three
Dstreams, all of which are the same window operation over that
socketTextStream. I
If you want to make Twitter* classes available in your shell, I believe you
could do the following
1. Change the parent pom module ordering - Move external/twitter before
assembly
2. In assembly/pom.xm, add external/twitter dependency - this will package
twitter* into the assembly jar
Now when
Hi Michael,
Good to know it is being handled. I tried master branch (9fe693b5) and got
another error:
scala sqlContext.parquetFile(/tmp/foo)
java.lang.RuntimeException: Unsupported parquet datatype optional
fixed_len_byte_array(4) b
at scala.sys.package$.error(package.scala:27)
at
Thank you so much for the information, now i have merge the fix of #1411 and
seems the HiveSQL works with:
SELECT name FROM people WHERE schools[0].time2.
But one more question is:
Is it possible or planed to support the schools.time format to filter the
record that there is an element inside
The problem is resolved. Thanks.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/jsonRDD-NoSuchMethodError-tp9688p9742.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
You could try the following: create a minimal project using sbt or Maven,
add spark-streaming-twitter as a dependency, run sbt assembly (or mvn
package) on that to create a fat jar (with Spark as provided dependency),
and add that to the shell classpath when starting up.
On Tue, Jul 15, 2014 at
Oh, maybe not. Please file another JIRA.
On Tue, Jul 15, 2014 at 12:34 AM, Pei-Lun Lee pl...@appier.com wrote:
Hi Michael,
Good to know it is being handled. I tried master branch (9fe693b5) and got
another error:
scala sqlContext.parquetFile(/tmp/foo)
java.lang.RuntimeException:
Hi,
I’ve got a problem with Spark Streaming and tshark.
While I’m running locally I have no problems with this code, but when I run it
on a EC2 cluster I get the exception shown just under the code.
def dissection(s: String): Seq[String] = {
try {
Process(hadoop command to create
Hi there,
I've been sucessfully using the precompiled Spark 1.0.0 Java api on a small
cluster in standalone mode. However, when I try to use Kryo serializer by
adding
conf.set(spark.serializer,org.apache.spark.serializer.KryoSerializer);
as suggested, Spark crashes out with the following error:
Agree. You end up with a core and a corer core to distinguish
between and it ends up just being more complicated. This sounds like
something that doesn't need a module.
On Tue, Jul 15, 2014 at 5:59 AM, Patrick Wendell pwend...@gmail.com wrote:
Adding new build modules is pretty high overhead, so
Dear Ankur,
Thanks so much!
Btw, is there any possibility to customise the partition strategy as we expect?
Best,
Yifan
On Jul 11, 2014, at 10:20 PM, Ankur Dave ankurd...@gmail.com wrote:
Hi Yifan,
When you run Spark on a single machine, it uses a local mode where one task
per core can
Filed SPARK-2446
2014-07-15 16:17 GMT+08:00 Michael Armbrust mich...@databricks.com:
Oh, maybe not. Please file another JIRA.
On Tue, Jul 15, 2014 at 12:34 AM, Pei-Lun Lee pl...@appier.com wrote:
Hi Michael,
Good to know it is being handled. I tried master branch (9fe693b5) and
got
Sorry, should be SPARK-2489
2014-07-15 19:22 GMT+08:00 Pei-Lun Lee pl...@appier.com:
Filed SPARK-2446
2014-07-15 16:17 GMT+08:00 Michael Armbrust mich...@databricks.com:
Oh, maybe not. Please file another JIRA.
On Tue, Jul 15, 2014 at 12:34 AM, Pei-Lun Lee pl...@appier.com wrote:
Hi Rajesh,
can you describe your spark cluster setup? I saw localhost:2181 for
zookeeper.
Best Regards,
Jerry
On Tue, Jul 15, 2014 at 9:47 AM, Madabhattula Rajesh Kumar
mrajaf...@gmail.com wrote:
Hi Team,
Could you please help me to resolve the issue.
*Issue *: I'm not able to connect
One vector to check is the HBase libraries in the --jars as in :
spark-submit --class your class --master master url --jars
Hey Diana,
Did you ever figure this out?
I’m running into the same exception, except in my case the function I’m
calling is a KMeans model.predict().
In regular Spark it works, and Spark Streaming without the call to
model.predict() also works, but when put together I get this serialization
Hi all,
I got a strange problem, I submit a reduce job(any one split), it finished
normally on Executor, log is:
14/07/15 21:08:56 INFO Executor: Serialized size of result for 0 is 10476031
14/07/15 21:08:56 INFO Executor: Sending result for 0 directly to driver
14/07/15 21:08:56 INFO Executor:
It works perfect, thanks!. I feel like I should have figured that out, I'll
chalk it up to inexperience with Scala. Thanks again.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-collect-take-functionality-tp9670p9772.html
Sent from the
Hi Rajesh,
I have a feeling that this is not directly related to spark but I might be
wrong. The reason why is that when you do:
Configuration configuration = HBaseConfiguration.create();
by default, it reads the configuration files hbase-site.xml in your
classpath and ... (I don't remember
Hi Nathan,
I think there are two possible reasons for this. One is that even though you
are caching RDDs, their lineage chain gets longer and longer, and thus
serializing each RDD takes more time. You can cut off the chain by using
RDD.checkpoint() periodically, say every 5-10 iterations. The
Make the Array a Seq.
On Tue, Jul 15, 2014 at 7:12 AM, Jaonary Rabarisoa jaon...@gmail.com
wrote:
Hi all,
How should I store a one to many relationship using spark sql and parquet
format. For example I the following case class
case class Person(key: String, name: String, friends:
HI folks,
I'm running into the following error when trying to perform a join in my
code:
java.lang.NoClassDefFoundError: Could not initialize class
org.apache.spark.sql.catalyst.types.LongType$
I see similar errors for StringType$ and also:
scala.reflect.runtime.ReflectError: value apache is
Hi all,
I was curious about the details of Spark speculation. So, my understanding
is that, when ³speculated² tasks are newly scheduled on other machines, the
original tasks are still running until the entire stage completes. This
seems to leave some room for duplicated work because some spark
Interesting, I run on my local one node cluster using apache hadoop
On Tue, Jul 15, 2014 at 7:55 AM, Jerry Lam chiling...@gmail.com wrote:
For your information, the SparkSubmit runs at the host you executed the
spark-submit shell script (which in turns invoke the SparkSubmit program).
Since
Hello All,
I am a newbie to Apache Spark, I would like to know the performance
benchmark of Apache Spark.
My current requirement is as follows
I have few files in 2 s3 buckets
Each file may have minimum of 1 million records. File data are tab
separated.
Have to compare few columns and filter the
Hi All,
I am trying to perform regularized logistic regression with mllib in python.
I have seen that this is possible in the following scala example:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/BinaryClassification.scala
But I do not see
Hi --
New to Spark and trying to figure out how to do a generate unique counts per
page by date given this raw data:
timestamp,page,userId
1405377264,google,user1
1405378589,google,user2
1405380012,yahoo,user1
..
I can do a groupBy a field and get the count:
val lines=sc.textFile(data.csv)
val
Are you registering multiple RDDs of case classes as tables concurrently?
You are possibly hitting SPARK-2178
https://issues.apache.org/jira/browse/SPARK-2178 which is caused by
SI-6240 https://issues.scala-lang.org/browse/SI-6240.
On Tue, Jul 15, 2014 at 10:49 AM, Keith Simmons
You can use .distinct.count on your user RDD.
What are you trying to achieve with the time group by?
—
Sent from Mailbox
On Tue, Jul 15, 2014 at 8:14 PM, buntu buntu...@gmail.com wrote:
Hi --
New to Spark and trying to figure out how to do a generate unique counts per
page by date given
Sounds like a job for Spark SQL:
http://spark.apache.org/docs/latest/sql-programming-guide.html !
On Tue, Jul 15, 2014 at 11:25 AM, Nick Pentreath
nick.pentre...@gmail.com wrote:
You can use .distinct.count on your user RDD.
What are you trying to achieve with the time group by?
—
Sent from
Hello all,
I am a newbie to Spark, Just analyzing the product. I am facing a
performance problem with hive, Trying analyse whether the Spark will solve
it or not. but it seems that Spark also taking lot of time.Let me know if I
miss anything.
shark select count(time) from table2;
OK
6050
Time
We have CDH 5.0.2 which doesn't include Spark SQL yet and may only be
available in CDH 5.1 which is yet to be released.
If Spark SQL is the only option then I might need to hack around to add it
into the current CDH deployment if thats possible.
--
View this message in context:
Thanks Nick.
All I'm attempting is to report number of unique visitors per page by date.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Count-distinct-with-groupBy-usage-tp9781p9786.html
Sent from the Apache Spark User List mailing list archive at
Ah, I didn't realize this was non-MLLib code. Do you mean to be
sending stochasticLossHistory
in the closure as well?
On Sun, Jul 13, 2014 at 1:05 AM, Kyle Ellrott kellr...@soe.ucsc.edu wrote:
It uses the standard SquaredL2Updater, and I also tried to broadcast it as
well.
The input is a
If you are counting per time and per page, then you need to group by
time and page not just page. Something more like:
csv.groupBy(csv = (csv(0),csv(1))) ...
This gives a list of users per (time,page). As Nick suggests, then you
count the distinct values for each key:
...
Hi guys,
Sorry, I'm also interested in this nested json structure.
I have a similar SQL in which I need to query a nested field in a json.
Does the above query works if it is used with sql(sqlText) assuming the
data is coming directly from hdfs via sqlContext.jsonFile?
The SPARK-2483
FWIW, I am unable to reproduce this using the example program locally.
On Tue, Jul 15, 2014 at 11:56 AM, Keith Simmons keith.simm...@gmail.com wrote:
Nope. All of them are registered from the driver program.
However, I think we've found the culprit. If the join column between two
tables is
Hello,
I am curious about something:
val result = for {
(dt,evrdd) - evrdds
val ct = evrdd.count
} yield (dt-ct)
works.
val result = for {
(dt,evrdd) - evrdds
val ct = evrdd.countByValue
} yield (dt-ct)
does not work. I get:
14/07/15 16:46:33 WARN
On Jul 15, 2014, at 12:06 PM, Yifan LI iamyifa...@gmail.com wrote:
Btw, is there any possibility to customise the partition strategy as we
expect?
I'm not sure I understand. Are you asking about defining a custom
Not sure I can help, but I ran into the same problem. Basically my use case
is a that I have a List of strings - which I then convert into a RDD using
sc.parallelize(). This RDD is then operated on by the foreach() function.
Same as you, I get a runtime exception :
java.lang.ClassCastException:
Hi Nan,
Great digging in -- that makes sense to me for when a job is producing some
output handled by Spark like a .count or .distinct or similar.
For the other part of the question, I'm also interested in side effects
like an HDFS disk write. If one task is writing to an HDFS path and
another
Hi, I wonder if I do wordcount on two different files, like this:
val file1 = sc.textFile(/...)
val file2 = sc.textFile(/...)
val wc1= file.flatMap(..).reduceByKey(_ + _,1)
val wc2= file.flatMap(...).reduceByKey(_ + _,1)
wc1.saveAsTextFile(titles.out)
wc2.saveAsTextFile(tables.out)
Would the
To give a few more details of my environment in case that helps you
reproduce:
* I'm running spark 1.0.1 downloaded as a tar ball, not built myself
* I'm running in stand alone mode, with 1 master and 1 worker, both on the
same machine (though the same error occurs with two workers on two
Thanks Sean!! Thats what I was looking for -- group by on mulitple fields.
I'm gonna play with it now. Thanks again!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Count-distinct-with-groupBy-usage-tp9781p9803.html
Sent from the Apache Spark User List
On second thought I am not entirely sure whether that bug is the issue. Are
you continuously appending to the file that you have copied to the
directory? Because filestream works correctly when the files are atomically
moved to the monitored directory.
TD
On Mon, Jul 14, 2014 at 9:08 PM,
Could you run it locally first to make sure it works, and you see output?
Also, I recommend going through the previous step-by-step approach to
narrow down where the problem is.
TD
On Mon, Jul 14, 2014 at 9:15 PM, hsy...@gmail.com hsy...@gmail.com wrote:
Actually, I deployed this on yarn
I see you have the code to convert to Record class but commented it out.
That is the right way to go. When you are converting it to a 4-tuple with
(data(type),data(name),data(score),data(school)) ... its of type
(Any, Any, Any, Any) as data(xyz) returns Any. And registerAsTable
probably doesnt
Hi,
I have a json file where the object definition in each line includes an
array component obj that contains 0 or more elements as shown by the
example below.
{name: 16287e9cdf, obj: [{min: 50,max: 59 }, {min: 20, max:
29}]},
{name: 17087e9cdf, obj: [{min: 30,max: 39 }, {min: 10, max:
19},
This sounds really really weird. Can you give me a piece of code that I can
run to reproduce this issue myself?
TD
On Tue, Jul 15, 2014 at 12:02 AM, Walrus theCat walrusthe...@gmail.com
wrote:
This is (obviously) spark streaming, by the way.
On Mon, Jul 14, 2014 at 8:27 PM, Walrus theCat
Can you print out the queryExecution?
(i.e. println(sql().queryExecution))
On Tue, Jul 15, 2014 at 12:44 PM, Keith Simmons keith.simm...@gmail.com
wrote:
To give a few more details of my environment in case that helps you
reproduce:
* I'm running spark 1.0.1 downloaded as a tar ball,
It makes sense what you said. But, when I proportionately reduce the heap
size, then also the problem persists. For instance, if I use 160 GB heap for
48 cores, whereas 80 GB heap for 24 cores, then also with 24 cores the
performance is better (although worse than 160 GB with 24 cores) than
Will do.
On Tue, Jul 15, 2014 at 12:56 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:
This sounds really really weird. Can you give me a piece of code that I
can run to reproduce this issue myself?
TD
On Tue, Jul 15, 2014 at 12:02 AM, Walrus theCat walrusthe...@gmail.com
wrote:
I haven't look at the implementation but what you would do with any
filesystem is write to a file inside the workspace directory of the task.
And then only the attempt of the task that should be kept will perform a
move to the final path. The other attempts are simply discarded. For most
Sure thing. Here you go:
== Logical Plan ==
Sort [key#0 ASC]
Project [key#0,value#1,value#2]
Join Inner, Some((key#0 = key#3))
SparkLogicalPlan (ExistingRdd [key#0,value#1], MapPartitionsRDD[2] at
mapPartitions at basicOperators.scala:176)
SparkLogicalPlan (ExistingRdd [value#2,key#3],
The last two lines are what trigger the operations, and they will each
block until the result is computed and saved. So if you execute this
code as-is, no. You could write a Scala program that invokes these two
operations in parallel, like:
Array((wc1,titles.out), (wc2,tables.out)).par.foreach {
I am still getting the error...even if i convert it to record
object KafkaWordCount {
def main(args: Array[String]) {
if (args.length 4) {
System.err.println(Usage: KafkaWordCount zkQuorum group topics
numThreads)
System.exit(1)
}
Just my few cents on this.
I having the same problems with v 1.0.1 but this bug is sporadic and looks
like is relayed to object initialization.
Even more, i'm not using any SQL or something. I just have utility class
like this:
object DataTypeDescriptor {
type DataType = String
val
Hi everyone.
I have some problems running multiple streams at the same time.
What i am doing is:
object Test {
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.api.java.function._
import org.apache.spark.streaming._
To add to my previous post, the error at runtime is teh following:
Exception in thread main org.apache.spark.SparkException: Job aborted due
to stage failure: Task 0.0:0 failed 1 times, most recent failure: Exception
failure in TID 0 on host localhost: org.json4s.package$MappingException:
Hi,
I would like to use the dataset used in the Big Data Benchmark
https://amplab.cs.berkeley.edu/benchmark/ on my own cluster, to run some
tests between Hadoop and Spark. The dataset should be available at
s3n://big-data-benchmark/pavlo/[text|text-deflate|sequence|sequence-snappy]/[suffix],
Hi Tom,
If you wish to load the file in Spark directly, you can use
sc.textFile(s3n://big-data-benchmark/pavlo/...) where sc is your
SparkContext. This can be
done because the files should be publicly available and you don't need AWS
Credentials to access them.
If you want to download the
If you're already using Scala for Spark programming and you hate Oozie XML
as much as I do ;), you might check out Scoozie, a Scala DSL for Oozie:
https://github.com/klout/scoozie
On Thu, Jul 10, 2014 at 5:52 PM, Andrei faithlessfri...@gmail.com wrote:
I used both - Oozie and Luigi - but found
Hello spark folks,
I have a simple spark cluster setup but I can't get jobs to run on it. I
am using the standlone mode.
One master, one slave. Both machines have 32GB ram and 8 cores.
The slave is setup with one worker that has 8 cores and 24GB memory
allocated.
My application requires 2
Yes, this is a proposed patch to MLLib so that you can use 1 RDD to train
multiple models at the same time. I am hoping that by multiplexing several
models in the same RDD will be more efficient then trying to get the Spark
scheduler to manage a few 100 tasks simultaneously.
I don't think I see
Have you looked at the slave machine to see if the process has
actually launched? If it has, have you tried peeking into its log
file?
(That error is printed whenever the executors fail to report back to
the driver. Insufficient resources to launch the executor is the most
common cause of that,
The way the HDFS file writing works at a high level is that each attempt to
write a partition to a file starts writing to unique temporary file (say,
something like targetDirectory/_temp/part-X_attempt-). If the
writing into the file successfully completes, then the temporary file is
moved
I got a bit progress. I think the problem is with the
BinaryClassificationMetrics,
as long as I comment out all the prediction related metrics, I can run the
svm example with my data.
So the problem should be there I guess.
--
View this message in context:
I am very curious though. Can you post a concise code example which we can
run to reproduce this problem?
TD
On Tue, Jul 15, 2014 at 2:54 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:
I am not entire sure off the top of my head. But a possible (usually
works) workaround is to define
Got a spark/shark cluster up and running recently, and have been kicking
the tires on it. However, been wrestling with an issue on it that I'm
not quite sure how to solve. (Or, at least, not quite sure about the
correct way to solve it.)
I ran a simple Hive query (select count ...) against
Yeah, this is handled by the commit call of the FileOutputFormat. In general
Hadoop OutputFormats have a concept called committing the output, which you
should do only once per partition. In the file ones it does an atomic rename to
make sure that the final output is a complete file.
Matei
On
Also, it helps if you post us logs, stacktraces, exceptions, etc.
TD
On Tue, Jul 15, 2014 at 10:07 AM, Jerry Lam chiling...@gmail.com wrote:
Hi Rajesh,
I have a feeling that this is not directly related to spark but I might be
wrong. The reason why is that when you do:
Configuration
Thanks a lot Sean, Daniel, Matei and Jerry. I really appreciate your reply.
And I also understand that I should be a little more patient. When I myself
is only not able to reply within next 5 hours how can I expect question to
be answered in that time.
And yes the Idea of using a separate
- user@incubator
Hi Keith,
I did reproduce this using local-cluster[2,2,1024], and the errors
look almost the same. Just wondering, despite the errors did your
program output any result for the join? On my machine, I could see the
correct output.
Zongheng
On Tue, Jul 15, 2014 at 1:46 PM,
hql and sql are just two different dialects for interacting with data.
After parsing is complete and the logical plan is constructed, the
execution is exactly the same.
On Tue, Jul 15, 2014 at 2:50 PM, Jerry Lam chiling...@gmail.com wrote:
Hi Michael,
I don't understand the difference
Creating multiple StreamingContexts using the same SparkContext is
currently not supported. :)
Guess it was not clear in the docs. Note to self.
TD
On Tue, Jul 15, 2014 at 1:50 PM, gorenuru goren...@gmail.com wrote:
Hi everyone.
I have some problems running multiple streams at the same
Oh, sad to hear that :(
From my point of view, creating separate spark context for each stream is to
expensive.
Also, it's annoying because we have to be responsible for proper akka and UI
port determination for each context.
Do you know about any plans to support it?
--
View this message in
Why do you need to create multiple streaming contexts at all?
TD
On Tue, Jul 15, 2014 at 3:43 PM, gorenuru goren...@gmail.com wrote:
Oh, sad to hear that :(
From my point of view, creating separate spark context for each stream is
to
expensive.
Also, it's annoying because we have to be
Hi Keith gorenuru,
This patch (https://github.com/apache/spark/pull/1423) solves the
errors for me in my local tests. If possible, can you guys test this
out to see if it solves your test programs?
Thanks,
Zongheng
On Tue, Jul 15, 2014 at 3:08 PM, Zongheng Yang zonghen...@gmail.com wrote:
-
Thanks Soumya - I guess the next step from here is to move the MLlib model
from the Spark application with simply does the training, and giving to the
client application which simply does the predictions. I will try the Kryo
library to physically serialize the object and trade it across machines /
Hello!
Just curious if anybody could respond to my original message, if anybody
knows about how to set the configuration variables that are handles by
Jetty and not Spark's native framework..which is Akka I think?
Thanks
On Thu, Jul 10, 2014 at 4:04 PM, Aris Vlasakakis a...@vlasakakis.com
Hi Tathagata,
I could see the output of count, but no sql results. Run in standalone is
meaningless for me and I just run in my local single node yarn cluster.
Thanks
On Tue, Jul 15, 2014 at 12:48 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:
Could you run it locally first to make
Yes, what Nick said is the recommended way. In most usecases, a spark
streaming program in production is not usually run from the shell. Hence,
we chose not to make the external stuff (twitter, kafka, etc.) available to
spark shell to avoid dependency conflicts brought it by them with spark's
Because I want to have different streams with different durations. Fornexample,
one triggers snapshot analysis each 5 minutes and another each 10 seconds On
Tue, Jul 15, 2014 at 3:59 pm, Tathagata Das [via Apache Spark User List]
lt;ml-node+s1001560n984...@n3.nabble.comgt; wrote:
Cool. So Michael's hunch was correct, it is a thread issue. I'm currently
using a tarball build, but I'll do a spark build with the patch as soon as
I have a chance and test it out.
Keith
On Tue, Jul 15, 2014 at 4:14 PM, Zongheng Yang zonghen...@gmail.com wrote:
Hi Keith gorenuru,
This
1 - 100 of 124 matches
Mail list logo