The problem is that I get the same results every time
On Friday, July 11, 2014 7:22 PM, Ameet Talwalkar atalwal...@gmail.com wrote:
Hi Wanda,
As Sean mentioned, K-means is not guaranteed to find an optimal answer, even
for seemingly simple toy examples. A common heuristic to deal with this
You should return an iterator in mapPartitionsWIthIndex. This is from
the programming guide
(http://spark.apache.org/docs/latest/programming-guide.html):
mapPartitionsWithIndex(func): Similar to mapPartitions, but also
provides func with an integer value representing the index of the
partition,
Hi,
I encounter an error when testing svm (example one) on very large sparse
data. The dataset I ran on was a toy dataset with only ten examples but 13
million sparse vector with a few thousands non-zero entries.
The errors is showing below. I am wondering is this a bug or I am missing
Make Catalyst independent of Spark is the goal of Catalyst, maybe need time
and evolution.
I awared that package org.apache.spark.sql.catalyst.util
embraced org.apache.spark.util.{Utils = SparkUtils},
so that Catalyst has a dependency on Spark core.
I'm not sure whether it will be replaced by
I am very interested in the original question as well, is there any
list (even if it is simply in the code) of all supported syntax for
Spark SQL?
On Mon, Jul 14, 2014 at 6:41 AM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
Are you sure the code running on the cluster has been updated?
You need to set a larger `spark.akka.frameSize`, e.g., 128, for the
serialized weight vector. There is a JIRA about switching
automatically between sending through akka or broadcast:
https://issues.apache.org/jira/browse/SPARK-2361 . -Xiangrui
On Mon, Jul 14, 2014 at 12:15 AM, crater
Hi Ankur,
FYI - in a naive attempt to enhance your solution, managed to create
MergePatternPath. I think it works in expected way (atleast for the traversing
problem in last email).
I modified your code a bit. Also instead of EdgePattern I used List of
Functions that match the whole edge
It worked! I was struggling for a week. Thanks a lot!
On Mon, Jul 14, 2014 at 12:31 PM, Xiangrui Meng [via Apache Spark User
List] ml-node+s1001560n9591...@n3.nabble.com wrote:
You should return an iterator in mapPartitionsWIthIndex. This is from
the programming guide
Hi, I encountered a weird problem in spark sql.
I use sbt/sbt hive/console to go into the shell.
I test the filter push down by using catalyst.
scala val queryPlan = sql(select value from (select key,value from src)a
where a.key=86 )
scala queryPlan.baseLogicalPlan
res0:
Hi guys,
I want to use Elasticsearch and HBase in my spark project, I want to create
a test. I pulled up ES and Zookeeper, but if I put val htest = new
HBaseTestingUtility() to my app I got a strange exception (compilation
time, not runtime).
https://gist.github.com/b0c1/4a4b3f6350816090c3b5
Hi everyone,
Currently I am working on parallelizing a machine learning algorithm using
a Microsoft HDInsight cluster. I tried running my algorithm on Hadoop
MapReduce, but since my algorithm is iterative the job scheduling overhead
and data loading overhead severely limits the performance of my
Hi,
I am using spark-sql 1.0.1 to load parquet files generated from method
described in:
https://gist.github.com/massie/7224868
When I try to submit a select query with columns of type fixed length byte
array, the following error pops up:
14/07/14 11:09:14 INFO scheduler.DAGScheduler: Failed
I am getting an error saying:
Exception in thread delete Spark temp dir
C:\Users\shawn\AppData\Local\Temp\spark-b4f1105c-d67b-488c-83f9-eff1d1b95786
java.io.IOExcept
ion: Failed to delete:
C:\Users\shawn\AppData\Local\Temp\spark-b4f1105c-d67b-488c-83f9-eff1d1b95786\tmppr36zu
at
Nicholas, thanks nevertheless! I am going to spend some time to try and
figure this out and report back :-)
Ognen
On 7/13/14, 7:05 PM, Nicholas Chammas wrote:
I actually never got this to work, which is part of the reason why I
filed that JIRA. Apart from using |--jar| when starting the
Hey, My question is for this situation:
Suppose we have 10 files each containing list of features in each row.
Task is that for each file cluster the features in that file and write the
corresponding cluster along with it in a new file. So we have to generate
10 more files by applying
Thanks for your answers Shuo Xiang and Aaron Davidson!
Regards,
--
*Gonzalo Zarza* | PhD in High-Performance Computing | Big-Data Specialist |
*GLOBANT* | AR: +54 11 4109 1700 ext. 15494 | US: +1 877 215 5230 ext. 15494
| [image: Facebook] https://www.facebook.com/Globant [image: Twitter]
Hi,
queryPlan.baseLogicalPlan is not the plan used to execution. Actually,
the baseLogicalPlan
of a SchemaRDD (queryPlan in your case) is just the parsed plan (the parsed
plan will be analyzed, and then optimized. Finally, a physical plan will be
created). The plan shows up after you execute val
I'm a Spark and HDInsight novice, so I could be wrong...
HDInsight is based on HDP2, so my guess here is that you have the option of
installing/configuring Spark in cluster mode (YARN) or in standalone mode
and package the Spark binaries with your job.
Everything I seem to look at is related to
Hi Patrick,
This is great news but I nearly missed the announcement because it had
scrolled off the folder view that I have Spark users list messages go
to. 40+ new threads since you sent the email out on Friday evening.
You might consider having someone on your team create a
Looks like going with cluster mode is not a good idea:
http://azure.microsoft.com/en-us/documentation/articles/hdinsight-administer-use-management-portal/
Seems like a non-HDInsight VM might be needed to make it the Spark master
node.
Marco
On Mon, Jul 14, 2014 at 12:43 PM, Marco Shaw
I am not sure how to write it…I tried writing to local file system using
FileWriter and Print Writer. I tried it inside the while loop. I am able to get
the text and able to print it but it fails when I use regular java classes.
Shouldn’t I use regular java classes here? Can I write to only
Hi xiangrui,
Where can I set the spark.akka.frameSize ?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Error-when-testing-with-large-sparse-svm-tp9592p9616.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
If you use Scala, you can do:
val conf = new SparkConf()
.setMaster(yarn-client)
.setAppName(Logistic regression SGD fixed)
.set(spark.akka.frameSize, 100)
.setExecutorEnv(SPARK_JAVA_OPTS, -Dspark.akka.frameSize=100)
var sc = new
hi
I am new to spark and scala and I am trying to do some aggregations on
json file stream using Spark Streaming. I am able to parse the json string
and it is converted to map(id - 123, name - srini, mobile - 12324214,
score - 123, test_type - math) now i want to use GROUPBY function on each
Hello,
I'm using the spark-0.9.1-bin-hadoop1 distribution, and the ec2/spark-ec2
script within it to spin up a cluster. I tried running my processing just
using the default (ephemeral) HDFS configuration, but my job errored out,
saying that there was no space left. So now I'm trying to increase
You can find the parser here:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SqlParser.scala
In general the hive parser provided by HQL is much more complete at the
moment. Long term we will likely stop using parser combinators and either
Hi,
My company is strongly considering implementing a recommendation engine that is
built off of statistical models using Spark. We attended the Spark Summit and
were incredibly impressed with the technology and the entire community. Since
then, we have been exploring the technology and
I understand that the question is very unprofessional, but I am a newbie.
If you could share some link where I can ask such questions, if not here.
But please answer.
On Mon, Jul 14, 2014 at 6:52 PM, Rahul Bhojwani rahulbhojwani2...@gmail.com
wrote:
Hey, My question is for this situation:
Hi Krishna,
Thanks for your help. Are you able to get your 29M data running yet? I fix
the previous problem by setting larger spark.akka.frameSize, but now I get
some other errors below. Did you get these errors before?
14/07/14 11:32:20 ERROR TaskSchedulerImpl: Lost executor 1 on node7: remote
Rahul, I'm not sure what you mean by your question being very
unprofessional. You can feel free to answer such questions here. You may
or may not receive an answer, and you shouldn't necessarily expect to have
your question answered within five hours.
I've never tried to do anything like your
You currently can't use SparkContext inside a Spark task, so in this case you'd
have to call some kind of local K-means library. One example you can try to use
is Weka (http://www.cs.waikato.ac.nz/ml/weka/). You can then load your text
files as an RDD of strings with SparkContext.wholeTextFiles
That is exactly the same error that I got. I am still having no success.
Regards,
Krishna
On Mon, Jul 14, 2014 at 11:50 AM, crater cq...@ucmerced.edu wrote:
Hi Krishna,
Thanks for your help. Are you able to get your 29M data running yet? I fix
the previous problem by setting larger
Hi there,
I think the question is interesting; a spark of sparks = spark
I wonder if you can use the spark job server (
https://github.com/ooyala/spark-jobserver)?
So in the spark task that requires a new spark context, instead of creating
it in the task, contact the job server to create one and
Stepping a bit back, if you just want to write flume data to HDFS, you can
use flume's HDFS sink for that.
Trying to do this using Spark Streaming and SparkFlumeEvent is
unnecessarily complex. And I guess it is tricky to write the raw bytes from
the sparkflumevent into a file. If you want to do
Thanks a lot for replying back.
Actually, I am running the SparkPageRank example with 160GB heap (I am sure
the problem is not GC because the excess time is being spent in java code
only).
What I have observed in Jprofiler and Oprofile outputs is that the amount of
time spent in following 2
You have to import StreamingContext._ to enable groupByKey operations on
DStreams. After importing that you can apply groupByKey on any DStream,
that is a DStream of key-value pairs (e.g. DStream[(String, Int)]) . The
data in each pair RDDs will be grouped by the first element in the tuple as
the
Seems like it is related. Possibly those PRs that Andrew mentioned are
going to fix this issue.
On Fri, Jul 11, 2014 at 5:51 AM, Haopu Wang hw...@qilinsoft.com wrote:
I saw some exceptions like this in driver log. Can you shed some
lights? Is it related with the behaviour?
14/07/11
The depends on your requirements. If you want to process the 250 GB input
file as a stream to emulate the stream of data, then it should be split
into files (such that event ordering is maintained in those splits, if
necessary). And then those splits should be moved one-by-one in the
directory
Are you increasing the number of parallel tasks with cores as well? With more
tasks there will be more data communicated and hence more calls to these
functions.
Unfortunately contention is kind of hard to measure, since often the result is
that you see many cores idle as they're waiting on a
When you are sending data using simple socket code to send messages, are
those messages \n delimited? If its not, then the receiver of
socketTextSTream, wont identify them as separate events, and keep buffering
them.
TD
On Sun, Jul 13, 2014 at 10:49 PM, kytay kaiyang@gmail.com wrote:
Hi
Are you compiling it within Spark using Spark's recommended way (see doc
web page)? Or are you compiling it in your own project? In the latter case,
make sure you are using the Scala 2.10.4.
TD
On Sun, Jul 13, 2014 at 6:43 AM, Mahebub Sayyed mahebub...@gmail.com
wrote:
Hello,
I am referring
This is not supported yet, but there is a PR open to fix it:
https://issues.apache.org/jira/browse/SPARK-2446
On Mon, Jul 14, 2014 at 4:17 AM, Pei-Lun Lee pl...@appier.com wrote:
Hi,
I am using spark-sql 1.0.1 to load parquet files generated from method
described in:
The problem is not really for local[1] or local. The problem arises when
there are more input streams than there are cores.
But I agree, for people who are just beginning to use it by running it
locally, there should be a check addressing this.
I made a JIRA for this.
Yeah, sadly this dependency was introduced when someone consolidated the
logging infrastructure. However, the dependency should be very small and
thus easy to remove, and I would like catalyst to be usable outside of
Spark. A pull request to make this possible would be welcome.
Ideally, we'd
What sort of nested query are you talking about? Right now we only support
nested queries in the FROM clause. I'd like to add support for other cases
in the future.
On Sun, Jul 13, 2014 at 4:11 AM, anyweil wei...@gmail.com wrote:
Or is it supported? I know I could doing it myself with
I'm trying to run a job that includes an invocation of a memory
compute-intensive multithreaded C++ program, and so I'd like to run one
task per physical node. Using rdd.coalesce(# nodes) seems to just allocate
one task per core, and so runs out of memory on the node. Is there any way
to give the
Hi All,
I've used the spark-ec2 scripts to build a simple 1.0.1 Standalone cluster on
EC2. It appears that the spark-submit script is not bundled with a spark-ec2
install. Given that: What is the recommended way to execute spark jobs on a
standalone EC2 cluster? Spark-submit provides
Handling of complex types is somewhat limited in SQL at the moment. It'll
be more complete if you use HiveQL.
That said, the problem here is you are calling .name on an array. You need
to pick an item from the array (using [..]) or use something like a lateral
view explode.
On Sat, Jul 12,
Hello Spark community,
I would like to write an application in Scala that i a model server. It
should have an MLlib Linear Regression model that is already trained on
some big set of data, and then is able to repeatedly call
myLinearRegressionModel.predict() many times and return the result.
Please look at the following.
https://github.com/ooyala/spark-jobserver
http://en.wikipedia.org/wiki/Predictive_Model_Markup_Language
https://github.com/EsotericSoftware/kryo
You can train your model convert it to PMML and return that to your client
OR
You can train your model and write that
I don't have a solution for you (sorry), but do note that
rdd.coalesce(numNodes) keeps data on the same nodes where it was. If you
set shuffle=true then it should repartition and redistribute the data. But
it uses the hash partitioner according to the ScalaDoc - I don't know of
any way to supply a
Hi all,
A newbie question, I start a spark yarn application through spark-submit
How do I kill this app. I can kill the yarn app by yarn application -kill
appid but the application master is still running. What's the proper way
to shutdown the entire app?
Best,
Siyuan
Hi Tathagata,
It seems repartition does not necessarily force Spark to distribute the
data into different executors. I have launched a new job which uses
repartition right after I received data from Kafka. For the first two
batches, the reduce stage used more than 80 executors. Starting from the
I think coalesce with shuffle=true will force it to have one task per node.
Without that, it might be that due to data locality it decides to launch
multiple ones on the same node even though the total # of tasks is equal to the
# of nodes.
If this is the *only* thing you run on the cluster,
Yeah, I'd just add a spark-util that has these things.
Matei
On Jul 14, 2014, at 1:04 PM, Michael Armbrust mich...@databricks.com wrote:
Yeah, sadly this dependency was introduced when someone consolidated the
logging infrastructure. However, the dependency should be very small and
thus
The script should be there, in the spark/bin directory. What command did you
use to launch the cluster?
Matei
On Jul 14, 2014, at 1:12 PM, Josh Happoldt josh.happo...@trueffect.com wrote:
Hi All,
I've used the spark-ec2 scripts to build a simple 1.0.1 Standalone cluster on
EC2. It
Depending on how your C++ program is designed, maybe you can feed the data
from multiple partitions into the same process? Getting the results back
might be tricky. But that may be the only way to guarantee you're only
using one invocation per node.
On Mon, Jul 14, 2014 at 5:12 PM, Matei Zaharia
Continuing to debug with Scala, I tried this on local with enough memory
(10g) and it is able to count the dataset. With more memory(for executor
and driver) in a cluster it still fails. The data is about 2Gbytes. It is
30k * 4k doubles.
On Sat, Jul 12, 2014 at 6:31 PM, Aaron Davidson
Hi,
Thanks for ur reply...i imported StreamingContext and right now i am
getting my Dstream as something like
map(id - 123, name - srini, mobile - 12324214, score - 123, test_type
- math)
map(id - 321, name - vasu, mobile - 73942090, score - 324, test_type
-sci)
map(id - 432, name -,
I restarted Spark Master with spark-0.9.1 and SparkR was able to communicate
with the Master. I am using the latest SparkR pkg-e1f95b6. Maybe it has
problem communicating to Spark 1.0.0?
--
View this message in context:
Hi,
Thanks for ur reply...i imported StreamingContext and right now i am
getting my Dstream as something like
map(id - 123, name - srini, mobile - 12324214, score - 123, test_type
- math)
map(id - 321, name - vasu, mobile - 73942090, score - 324, test_type
-sci)
map(id - 432, name -,
Can you give me a screen shot of the stages page in the web ui, the spark
logs, and the code that is causing this behavior. This seems quite weird to
me.
TD
On Mon, Jul 14, 2014 at 2:11 PM, Bill Jay bill.jaypeter...@gmail.com
wrote:
Hi Tathagata,
It seems repartition does not necessarily
In general it may be a better idea to actually convert the records from
hashmaps, to a specific data structure. Say
case class Record(id: int, name: String, mobile: String, score: Int,
test_type: String ... )
Then you should be able to do something like
val records = jsonf.map(m =
I'm using spark 1.0.0 (three weeks old build of latest).
Along the lines of this tutorial
http://ampcamp.berkeley.edu/big-data-mini-course/realtime-processing-with-spark-streaming.html
, I want to read some tweets from twitter.
When trying to execute in the Spark-Shell, I get
The tutorial
The twitter functionality is not available through the shell.
1) we separated these non-core functionality into separate subprojects so
that their dependencies do not collide/pollute those of of core spark
2) a shell is not really the best way to start a long running stream.
Its best to use
I just wanted to send out a quick note about a change in the handling of
strings when loading / storing data using parquet and Spark SQL. Before,
Spark SQL did not support binary data in Parquet, so all binary blobs were
implicitly treated as Strings. 9fe693
Hi All,
Couple days ago, I tried to integrate SQL and streaming together. My
understanding is I can transform RDD from Dstream to schemaRDD and execute
SQL on each RDD. But I got no luck
Would you guys help me take a look at my code? Thank you very much!
object KafkaSpark {
def main(args:
Trying answer your questions as concisely as possible
1. In the current implementation, the entire state RDD needs to loaded for
any update. It is a known limitation, that we want to overcome in the
future. Therefore the state Dstream should not be persisted to disk as all
the data in the state
Hello everyone,
I'm an undergrad working on a summarization project. I've created a
summarizer in normal Spark and it works great, however I want to write it
for Spark_Streaming to increase it's functionality. Basically I take in a
bunch of text and get the most popular words as well as most
Then yarn application -kill appid should work. This is what I did 2 hours ago.
Sorry I cannot provide more help.
Sent from my iPhone
On 14 Jul, 2014, at 6:05 pm, hsy...@gmail.com hsy...@gmail.com wrote:
yarn-cluster
On Mon, Jul 14, 2014 at 2:44 PM, Jerry Lam chiling...@gmail.com
Before yarn application -kill If you do jps You'll have a list
of SparkSubmit and ApplicationMaster
After you use yarn applicaton -kill you only kill the SparkSubmit
On Mon, Jul 14, 2014 at 4:29 PM, Jerry Lam chiling...@gmail.com wrote:
Then yarn application -kill appid should work. This is
Could you elaborate on what is the problem you are facing? Compiler error?
Runtime error? Class-not-found error? Not receiving any data from Kafka?
Receiving data but SQL command throwing error? No errors but no output
either?
TD
On Mon, Jul 14, 2014 at 4:06 PM, hsy...@gmail.com
Why doesnt something like this work? If you want a continuously updated
reference to the top counts, you can use a global variable.
var topCounts: Array[(String, Int)] = null
sortedCounts.foreachRDD (rdd =
val currentTopCounts = rdd.take(10)
// print currentTopCounts it or watever
Oh yes, this was a bug and it has been fixed. Checkout from the master
branch!
https://issues.apache.org/jira/browse/SPARK-2362?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20Streaming%20ORDER%20BY%20created%20DESC%2C%20priority%20ASC
TD
On Mon, Jul
No errors but no output either... Thanks!
On Mon, Jul 14, 2014 at 4:59 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:
Could you elaborate on what is the problem you are facing? Compiler error?
Runtime error? Class-not-found error? Not receiving any data from Kafka?
Receiving data but
Thanks. Can I see that a Class is not available in the shell somewhere in the
API Docs or do I have to find out by trial and error?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/import-org-apache-spark-streaming-twitter-in-Shell-tp9665p9678.html
Sent from
Is it on a standalone server? There are several settings worthing checking:
1) number of partitions, which should match the number of cores
2) driver memory (you can see it from the executor tab of the Spark
WebUI and set it with --driver-memory 10g
3) the version of Spark you were running
Best,
I tried installing the latest Spark 1.0.1 and SparkR couldn't find the master
either. I restarted with Spark 0.9.1 and SparkR was able to find the
master. So, there seemed to be something that changed after Spark 1.0.0.
--
View this message in context:
Can you make sure you are running locally on more than 1 local cores? You
could set the master in the SparkConf as conf.setMaster(local[4]). Then
see if there are jobs running on every batch of data in the Spark web ui
(running on localhost:4040). If you still dont get any output, try first
simple
I guess this is not clearly documented. At a high level, any class that is
in the package
org.apache.spark.streaming.XXX where XXX is in { twitter, kafka, flume,
zeromq, mqtt }
is not available in the Spark shell.
I have added this to the larger JIRA of things-to-add-to-streaming-docs
On Mon, Jul 14, 2014 at 6:52 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:
The twitter functionality is not available through the shell.
I've been processing Tweets live from the shell, though not for a long
time. That's how I uncovered the problem with the Twitter receiver not
I am running Spark 1.0.1 on a 5 node yarn cluster. I have set the
driver memory to 8G and executor memory to about 12G.
Regards,
Krishna
On Mon, Jul 14, 2014 at 5:56 PM, Xiangrui Meng men...@gmail.com wrote:
Is it on a standalone server? There are several settings worthing checking:
1)
BTW you can see the number of parallel tasks in the application UI
(http://localhost:4040) or in the log messages (e.g. when it says progress:
17/20, that means there are 20 tasks total and 17 are done). Spark will try to
use at least one task per core in local mode so there might be more of
Hi spark !
What is the purpose of the randomly assigned SPARK_WORKER_PORT
from the documentation it sais to join a cluster, but its not clear to me
how a random port could be used to communicate with other members of a
spark pool.
This question might be grounded in my ignorance ... if so
Hi,
I am using Spark 1.0.1. I am using the following piece of code to parse a
json file. It is based on the code snippet in the SparkSQL programming
guide. However, the compiler outputs an error stating:
java.lang.NoSuchMethodError:
I use queryPlan.queryExecution.analyzed to get the logical plan.
it works.
And What you explained to me is very useful.
Thank you very much.
--
View this message in context:
Neither do they work in new 1.0.1 either
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/cores-option-in-spark-shell-tp6809p9690.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Have you upgraded the cluster where you are running this 1.0.1 as
well? A NoSuchMethodError
almost always means that the class files available at runtime are different
from those that were there when you compiled your program.
On Mon, Jul 14, 2014 at 7:06 PM, SK skrishna...@gmail.com wrote:
Yes, the documentation is actually a little outdated. We will get around to
fix it shortly. Please use --driver-cores or --executor-cores instead.
2014-07-14 19:10 GMT-07:00 cjwang c...@cjwang.us:
Neither do they work in new 1.0.1 either
--
View this message in context:
Did you make any updates in Spark version recently, after which you noticed
this problem? Because if you were using Spark 0.8 and below, then twitter
would have worked in the Spark shell. In Spark 0.9, we moved those
dependencies out of the core spark for those to update more freely without
If we're talking about the issue you captured in SPARK-2464
https://issues.apache.org/jira/browse/SPARK-2464, then it was a newly
launched EC2 cluster on 1.0.1.
On Mon, Jul 14, 2014 at 10:48 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:
Did you make any updates in Spark version
Hi all,When running GraphX applications on Spark, task scheduler may schedule
some tasks to be executed on RACK_LOCAL executors,but the tasks get halting in
that case, repeating print the following log information:
14-07-14 15:59:14 INFO [Executor task launch worker-6]
Hi,
congratulations on the release! I'm always pleased to see how features pop
up in new Spark versions that I had added for myself in a very hackish way
before (such as JSON support for Spark SQL).
I am wondering if there is any good way to learn early about what is going
to be in upcoming
I'm wondering if anyone has had success with an EC2 deployment of the
https://github.com/apache/spark/tree/branch-1.0-jdbc
https://github.com/apache/spark/tree/branch-1.0-jdbc branch that Michael
Armbrust referenced in his Unified Data Access with Spark SQL
eager to know this issue too,does any one knows how?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/hdfs-replication-on-saving-RDD-tp289p9700.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
You can change this setting through SparkContext.hadoopConfiguration, or put
the conf/ directory of your Hadoop installation on the CLASSPATH when you
launch your app so that it reads the config values from there.
Matei
On Jul 14, 2014, at 8:06 PM, valgrind_girl 124411...@qq.com wrote:
eager
Hi,
I've got a socketTextStream through which I'm reading input. I have three
Dstreams, all of which are the same window operation over that
socketTextStream. I have a four core machine. As we've been covering
lately, I have to give a cores parameter to my StreamingSparkContext:
ssc = new
Oh right, that could have happened only after Spark 1.0.0. So let me
clarify. At some point, you were able to access TwitterUtils from spark
shell using Spark 1.0.0+ ? If yes, then what change in Spark caused it to
not work any more?
TD
On Mon, Jul 14, 2014 at 7:52 PM, Nicholas Chammas
(1) What is number of partitions? Is it number of workers per node?
(2) I already set the driver memory pretty big, which is 25g.
(3) I am running Spark 1.0.1 in standalone cluster with 9 nodes, 1 one them
works as master, others are workers.
--
View this message in context:
Hi Team,
Is this issue with JavaStreamingContext.textFileStream(hdfsfolderpath)
API also? Please conform. If yes, could you please help me to fix this
issue. I'm using spark 1.0.0 version.
Regards,
Rajesh
On Tue, Jul 15, 2014 at 5:42 AM, Tathagata Das tathagata.das1...@gmail.com
wrote:
Oh
1 - 100 of 106 matches
Mail list logo