Hi,
We've got the same problem here (randomly happens) :
Unable to
find class: 6 4 Ú4Ú» 8 4î4Úº*Q|T4â` j4 Ǥ4ê´g8
4 ¾4Ú» 4 4Ú» pE4ʽ4ں*WsѴμˁ4ڻ4ʤ4ցbל4ڻ
it looks like when you configure sparkconfig to use the kryoserializer in
combination of using an ActorReceiver, bad things happen. I modified the
ActorWordCount example program from
val sparkConf = new SparkConf().setAppName(ActorWordCount)
to
val sparkConf = new SparkConf()
Yeah reduce() will leave you with one big collection of sets on the
driver. Maybe the set of all identifiers isn't so big -- a hundred
million Longs even isn't so much. I'm glad to hear cartesian works but
can that scale? you're making an RDD of N^2 elements initially which
is just vast.
On Thu,
Hi,
I am using the Java api of Spark.
I wanted to know if there is a way to run some code in a manner that is
like the setup() and cleanup() methods of Hadoop Map/Reduce
The reason I need it is because I want to read something from the DB
according to each record I scan in my Function, and I
Can you just call fileStream or textFileStream in the second app, to
consume files that appear in HDFS / Tachyon from the first job?
On Thu, Jul 24, 2014 at 2:43 AM, Barnaby bfa...@outlook.com wrote:
If I save an RDD as a sequence file such as:
val wordCounts = words.map(x = (x,
Hello All,
I am new user of spark, I am using *cloudera-quickstart-vm-5.0.0-0-vmware*
for execute sample examples of Spark.
I am very sorry for silly and basic question.
I am not able to deploy and execute sample examples of spark.
please suggest me *how to start with spark*.
Please help me
Hi,
I have a scala application which I have launched into a spark cluster. I
have the following statement trying to save to a folder in the master:
saveAsHadoopFile[TextOutputFormat[NullWritable,
Text]](hdfs://masteripaddress:9000/root/test-app/test1/)
The application is executed successfully and
Are you sure the RDD that you were saving isn't empty!?
Are you seeing a _SUCCESS file in this location? hdfs://
masteripaddress:9000/root/test-app/test1/
(Do hadoop fs -ls hdfs://masteripaddress:9000/root/test-app/test1/)
Thanks
Best Regards
On Thu, Jul 24, 2014 at 4:24 PM, lmk
Here's the complete overview http://spark.apache.org/docs/latest/
And Here's the quick start guidelines
http://spark.apache.org/docs/latest/quick-start.html
I would suggest you downloading the Spark pre-compiled binaries
Hi Akhil,
I am sure that the RDD that I saved is not empty. I have tested it using
take.
But is there no way that I can see this saved physically like we do in the
normal context? Can't I view this folder as I am already logged into the
cluster?
And, should I run hadoop fs -ls
This piece of code
saveAsHadoopFile[TextOutputFormat[NullWritable,Text]](hdfs://
masteripaddress:9000/root/test-app/test1/)
Saves the RDD into HDFS, and yes you can physically see the files using the
hadoop command (hadoop fs -ls /root/test-app/test1 - yes you need to login
to the cluster). In
Thanks Akhil.
I was able to view the files. Actually I was trying to list the same using
regular ls and since it did not show anything I was concerned.
Thanks for showing me the right direction.
Regards,
lmk
--
View this message in context:
If you want to connect to DB in program, you can use JdbcRDD (
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala
)
2014-07-24 18:32 GMT+08:00 Yosi Botzer yosi.bot...@gmail.com:
Hi,
I am using the Java api of Spark.
I wanted to know if there
I am currently facing the same problem. error snapshot as below:
14-07-24 19:15:30 WARN [pool-3-thread-1] SendingConnection: Error
finishing connection to r64b22034.tt.net/10.148.129.84:47525
java.net.ConnectException: Connection timed out
at
Thank you Nishkam,
I have read your code. So, for the sake of my understanding, it seems that
for each spark context there is one executor per node? Can anyone confirm
this?
--
Martin Goodson | VP Data Science
(0)20 3397 1240
[image: Inline image 1]
On Thu, Jul 24, 2014 at 6:12 AM, Nishkam
In my case I want to reach HBase. For every record with userId I want to
get some extra information about the user and add it to result record for
further prcessing
On Thu, Jul 24, 2014 at 9:11 AM, Yanbo Liang yanboha...@gmail.com wrote:
If you want to connect to DB in program, you can use
I understand that GraphX is not yet available for pyspark. I was wondering
if the Spark team has set a target release and timeframe for doing that
work?
Thank you,
Eric
First thing... Go into the Cloudera Manager and make sure that the Spark
service (master?) is started.
Marco
On Thu, Jul 24, 2014 at 7:53 AM, Sameer Sayyed sam.sayyed...@gmail.com
wrote:
Hello All,
I am new user of spark, I am using *cloudera-quickstart-vm-5.0.0-0-vmware*
for execute
Hi,
I ran spark standalone mode on a cluster and it went well for approximately
one hour, then the driver's output stopped with the following:
14/07/24 08:07:36 INFO MapOutputTrackerMasterActor: Asked to send map output
locations for shuffle 36 to spark@worker5.local:47416
14/07/24 08:07:36 INFO
Which code do you used, do you caused by your own code or something in
spark itself?
On Tue, Jul 22, 2014 at 8:50 AM, hsy...@gmail.com hsy...@gmail.com wrote:
I have the same problem
On Sat, Jul 19, 2014 at 12:31 AM, lihu lihu...@gmail.com wrote:
Hi,
Everyone. I have a piece of
You will have to define your own stream-to-iterator function and use the
socketStream. The function should return custom delimited object as bytes
are continuously coming in. When data is insufficient, the function should
block.
TD
On Jul 23, 2014 6:52 PM, kytay kaiyang@gmail.com wrote:
Hi
Great - thanks for the clarification Aaron. The offer stands for me to
write some documentation and an example that covers this without leaving
*any* room for ambiguity.
--
Martin Goodson | VP Data Science
(0)20 3397 1240
[image: Inline image 1]
On Thu, Jul 24, 2014 at 6:09 PM, Aaron
Whoops, I was mistaken in my original post last year. By default, there is
one executor per node per Spark Context, as you said.
spark.executor.memory is the amount of memory that the application
requests for each of its executors. SPARK_WORKER_MEMORY is the amount of
memory a Spark Worker is
Hi Sameer,
I think it is much easier to start using Spark in standalone mode on a single
machine. Last time I tried cloudera manager to deploy spark, it wasn't very
straight forward and I hit couple of obstacles along the way. However,
standalone mode is very easy to start exploring spark.
We are also eagerly waiting for akka 2.3.4 support as we use Akka (and Spray)
directly in addition to Spark.
Yardena
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/akka-2-3-x-tp10513p10597.html
Sent from the Apache Spark User List mailing list archive
Hi Sam,
I tried Spark on Cloudera a couple month age, any there were a lot of issues…
Fortunately, I was able to switch to Hortonworks and exerting works perfect. In
general, you can try two mode: standalone and via YARN. Personally, I found
using Spark via YARN more comfortable special for
I'm trying to run a simple pipeline using PySpark, version 1.0.1
I've created an RDD over a parquetFile and am mapping the contents with a
transformer function and now wish to write the data out to HDFS.
All of the executors fail with the same stack trace (below)
I do get a directory on HDFS,
This is being tracked here: https://issues.apache.org/jira/browse/SPARK-1812,
since it will also be needed for cross-building with Scala 2.11. Maybe we can
do it before that. Probably too late for 1.1, but you should open an issue for
1.2.
In that JIRA I linked, there's a pull request from a
Hi there,
This issue has been mentioned in:
http://apache-spark-user-list.1001560.n3.nabble.com/Java-IO-Stream-Corrupted-Invalid-Type-AC-td6925.html
However I'm starting a new thread since the issue is distinct from the above
topic's designated subject.
I'm test-running canonical conflation
Our system works with RDDs generated from Hadoop files. It processes each
record in a Hadoop file and for a subset of those records generates output
that is written to an external system via RDD.foreach. There are no
dependencies between the records that are processed.
If writing to the external
Hi,
Is there a way to get the number of slaves/workers during runtime?
I searched online but didn't find anything :/ The application I'm working
will run on different clusters corresponding to different deployment stages
(beta - prod). It would be great to get the number of slaves currently in
Hi Sarath,
I will try to reproduce the problem.
Thanks,
Yin
On Wed, Jul 23, 2014 at 11:32 PM, Sarath Chandra
sarathchandra.jos...@algofusiontech.com wrote:
Hi Michael,
Sorry for the delayed response.
I'm using Spark 1.0.1 (pre-built version for hadoop 1). I'm running spark
programs on
Scala By the Bay (www.scalabythebay.org) is happy to confirm that our
Spark training on August 11-12 will be run by Databricks and By the
Bay together. It will be focused on Scala, and is the first Spark
Training at a major Scala venue. Spark is written in Scala, with the
unified data pipeline
Try sc.getExecutorStorageStatus().length
SparkContext's getExecutorMemoryStatus or getExecutorStorageStatus will
give you back an object per executor - the StorageStatus objects are what
drives a lot of the Spark Web UI.
Anyone out there have a good configuration for emacs? Scala-mode sort of
works, but I¹d love to see a fully-supported spark-mode with an inferior
shell. Searching didn¹t turn up much of anything.
Any emacs users out there? What setup are you using?
Cheers,
- SteveN
--
CONFIDENTIALITY
Hi,
The mllib.clustering.kmeans implementation supports a random or parallel
initialization mode to pick the initial centers. is there a way to specify
the initial centers explictly? It would be useful to have a setCenters()
method where we can explicitly specify the initial centers. (For e.g. R
Hi Art,
I have some advice that isn't spark-specific at all, so it doesn't
*exactly* address your questions, but you might still find helpful. I
think using an implicit to add your retyring behavior might be useful. I
can think of two options:
1. enriching RDD itself, eg. to add a
SO this is good information for standalone, but how is memory distributed
within Mesos? There's coarse grain mode where the execute stays active, or
theres fine grained mode where it appears each task is it's only process in
mesos, how to memory allocations work in these cases? Thanks!
On Thu,
whoops! just realized I was retyring the function even on success. didn't
pay enough attention to the output from my calls. Slightly updated
definitions:
class RetryFunction[-A](nTries: Int,f: A = Unit) extends Function[A,Unit]
{
def apply(a: A): Unit = {
var tries = 0
var success =
As a source, I have a textfile with n rows that each contain m
comma-separated integers.
Each row is then converted into a feature vector with m features each.
I've noticed, that given the same total filesize and number of features, a
larger number of columns is much more expensive for training
Hi,
I'm doing the following:
def main(args: Array[String]) = {
val sparkConf = new SparkConf().setAppName(AvroTest).setMaster(local[2])
val sc = new SparkContext(sparkConf)
val conf = new Configuration()
val job = new Job(conf)
val path = new Path(/tmp/a.avro);
val
Can any one help me understand the key difference between mapToPair vs
flatMapToPair vs flatMap functions and also when to apply these functions in
particular.
--
View this message in context:
You can set the Java option -Dsun.io.serialization.extendedDebugInfo=true to
have more information about the object be printed. It will help you trace
down the how the SparkContext is getting included in some kind of closure.
TD
On Thu, Jul 24, 2014 at 9:48 AM, lihu lihu...@gmail.com wrote:
Thanks, this is what I needed :) I should have searched more...
Something I noticed though: after the SparkContext is initialized, I had to
wait for a few seconds until sc.getExecutorStorageStatus.length returns the
correct number of workers in my cluster (otherwise it returns 1, for the
I have the streaming program writing sequence files. I can find one of the
files and load it in the shell using:
scala val rdd = sc.sequenceFile[String,
Int](tachyon://localhost:19998/files/WordCounts/20140724-213930)
14/07/24 21:47:50 INFO storage.MemoryStore: ensureFreeSpace(32856) called
Yep, here goes!
Here are my environment vitals:
- Spark 1.0.0
- EC2 cluster with 1 slave spun up using spark-ec2
- twitter4j 3.0.3
- spark-shell called with --jars argument to load
spark-streaming-twitter_2.10-1.0.0.jar as well as all the twitter4j
jars.
Now, while I’m in the
Hi Sarath,
Have you tried the current branch 1.0? If not, can you give it a try and
see if the problem can be resolved?
Thanks,
Yin
On Thu, Jul 24, 2014 at 11:17 AM, Yin Huai yh...@databricks.com wrote:
Hi Sarath,
I will try to reproduce the problem.
Thanks,
Yin
On Wed, Jul 23,
The Pair ones return a JavaPairRDD, which has additional operations on
key-value pairs. Take a look at
http://spark.apache.org/docs/latest/programming-guide.html#working-with-key-value-pairs
for details.
Matei
On Jul 24, 2014, at 3:41 PM, abhiguruvayya sharath.abhis...@gmail.com wrote:
Can
You can refer this topic
http://www.mapr.com/developercentral/code/loading-hbase-tables-spark
2014-07-24 22:32 GMT+08:00 Yosi Botzer yosi.bot...@gmail.com:
In my case I want to reach HBase. For every record with userId I want to
get some extra information about the user and add it to result
I can successfully run my code in local mode using spark-submit (--master
local[4]), but I got ExceptionInInitializerError errors in Yarn-client mode.
Any hints what is the problem? Is it a closure serialization problem? How
can I debug it? Your answers would be very helpful.
14/07/25 11:48:14
On Thu, Jul 24, 2014 at 9:52 AM, Arun Kumar toga...@gmail.com wrote:
While using pregel API for Iterations how to figure out which super step
the iteration currently in.
The Pregel API doesn't currently expose this, but it's very straightforward
to modify Pregel.scala
51 matches
Mail list logo