I think you may be missing a key word here. Are you saying that the machine
has multiple interfaces and it is not using the one you expect or the
receiver is not running on the machine you expect?
On Sep 26, 2014 3:33 AM, centerqi hu cente...@gmail.com wrote:
Hi all
My code is as follows:
Hi Team,
Could you please respond on my below request.
Regards,
Rajesh
On Thu, Sep 25, 2014 at 11:38 PM, Madabhattula Rajesh Kumar
mrajaf...@gmail.com wrote:
Hi Team,
Can I use Actors in Spark Streaming based on events type? Could you please
review below Test program and let me know if
Could you advise the best practice of using some named tuples in Scala
with Spark RDD.
Currently we can access by a field number in a tuple:
RDD.map{_.2}
But want to see such construction:
RDD.map{_.itemId}
This one will be helpful for debugging purposes.
--
View this message in context:
I think you are simply looking for a case class in Scala. It is a
simple way to define an object with named, typed fields.
On Fri, Sep 26, 2014 at 8:31 AM, rzykov rzy...@gmail.com wrote:
Could you advise the best practice of using some named tuples in Scala
with Spark RDD.
Currently we can
Hi, I'm querying a big table using Spark SQL. I see very long GC time in
some stages. I wonder if I can improve it by tuning the storage
parameter.
The question is: the schemaRDD has been cached with cacheTable()
function. So is the cached schemaRDD part of memory storage controlled
by the
Hello Andrew!
Thanks for reply. Which logs and on what level should I check? Driver,
master or worker?
I found this on master node, but there is only ANY locality requirement.
Here it is the driver (spark sql) log -
https://gist.github.com/13h3r/c91034307caa33139001 and one of the workers
log -
Just wanted to answer my question in case someone else runs into the same
problem.
It is related to the problem discussed here:
http://apache-spark-developers-list.1001551.n3.nabble.com/Lost-executor-on-YARN-ALS-iterations-td7916.html
and here:
https://issues.apache.org/jira/browse/SPARK-2121
Hi Helene,
Thanks for the report. In Spark 1.1, we use a special exit code to
indicate |SparkSubmit| fails because of class not found. But
unfortunately I chose a not so special exit code — 1… So whenever the
process exit with 1 as exit code, the |-Phive| error message is shown. A
PR that
H Twinkle,
The failure is caused by case sensitivity. The temp table actually
stores the original un-analyzed logical plan, thus field names remain
capital (F1, F2, etc.). I believe this issue has already been fixed by
PR #2382 https://github.com/apache/spark/pull/2382. As a workaround,
you
Can someone explain the motivation behind passing executorAdded event to
DAGScheduler ? *DAGScheduler *does *submitWaitingStages *when *executorAdded
*method is called by *TaskSchedulerImpl*. I see some issue in the below
code,
*TaskSchedulerImpl.scala code*
if (!executorsByHost.contains(o.host))
Some corrections.
On Fri, Sep 26, 2014 at 5:32 PM, praveen seluka praveen.sel...@gmail.com
wrote:
Can someone explain the motivation behind passing executorAdded event to
DAGScheduler ? *DAGScheduler *does *submitWaitingStages *when *executorAdded
*method is called by *TaskSchedulerImpl*. I
You can avoid install Spark on each node by uploading Spark distribution
tarball file to HDFS setting |spark.executor.uri| to the HDFS location.
In this way, Mesos will download and the tarball file before launching
containers. Please refer to this Spark documentation page
just a quick reply, we cannot start two executors in the same host for a single
application in the standard deployment (one worker per machine)
I’m not sure if it will create an issue when you have multiple workers in the
same host, as submitWaitingStages is called everywhere and I never try
In Yarn, we can easily have multiple containers allocated in the same node.
On Fri, Sep 26, 2014 at 6:05 PM, Nan Zhu zhunanmcg...@gmail.com wrote:
just a quick reply, we cannot start two executors in the same host for a
single application in the standard deployment (one worker per machine)
I am unable to run hive scripts in Spark 1.1.0 pro-grammatically in hadoop
prompt but I could do it manually.
Can anyone help me to run hive scripts pro-grammatically in spark1.1.0
cluster on EMR?
Manual running steps:-
hadoop@ip-10-151-71-224:~/tmpSpark/spark1.1/spark$ ./bin/spark-shell
This is reasonable, since the actual constructor gets called is
|Driver()| rather than |Driver(HiveConf)|. The former initializes the
|conf| field by:
|conf = SessionState.get().getConf()
|
And |SessionState.get()| reads a TSS value. Thus executing SQL queries
within another thread causes
Hi all,
I apologise for re-posting this, I realise some mail systems are filtering
all the code samples from the original post.
I would greatly appreciate any pointer regarding, this issue basically
renders spark streaming not fault-tolerant for us.
Thanks in advance,
S
---
I experience
Has anyone else seen this erorr in task deserialization? The task is
processing a small amount of data and doesn't seem to have much data
hanging to the closure? I've only seen this with Spark 1.1
Job aborted due to stage failure: Task 975 in stage 8.0 failed 4
times, most recent failure: Lost
If the size of each file is small, you may try
|SparkContext.wholeTextFiles|. Otherwise you can try something like this:
|val filenames: Seq[String] = ...
val combined: RDD[(String,String)] = filenames.map { name =
sc.textFile(name).map(line = name - line)
}.reduce(_ ++ _)
|
On 9/26/14
Yes it is. The in-memory storage used with |SchemaRDD| also uses
|RDD.cache()| under the hood.
On 9/26/14 4:04 PM, Haopu Wang wrote:
Hi, I'm querying a big table using Spark SQL. I see very long GC time in
some stages. I wonder if I can improve it by tuning the storage
parameter.
The
Hi all,
I am using mappartitions to do some heavy computing on subsets of the data.
I have a dataset with about 1m rows, running on a 32 core cluster.
Unfortunately, is seems that mappartitions splits the data into two sets so
it is only running on two cores.
Is there a way to force it to split
Use RDD.repartition (see here:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD
).
On Fri, Sep 26, 2014 at 10:19 AM, jamborta jambo...@gmail.com wrote:
Hi all,
I am using mappartitions to do some heavy computing on subsets of the data.
I have a dataset with
Yes, I’m running Hadoop’s Timeline server that does this for the YARN/Hadoop
logs (and works very nicely btw). Are you saying I can do the same for the
SparkUI as well? Also, where do I set these Spark configurations since this
will be executed inside a YARN container? On the “client”
Thank you Jey,
That is a nice introduction but it is a may be to old (AUG 21ST, 2013)
Note: If you keep the schema flat (without nesting), the Parquet files you
create can be read by systems like Shark and Impala. These systems allow you
to query Parquet files as tables using SQL-like syntax.
Hi Matthes,
Can you post an example of your schema? When you refer to nesting, are you
referring to optional columns, nested schemas, or tables where there are
repeated values? Parquet uses run-length encoding to compress down columns with
repeated values, which is the case that your example
I am working on a PR that allows one to send the same spark listener event
message back to the application in yarn cluster mode.
So far I have put this function in our application, our UI will receive and
display the same spark job event message such as progress, job start, completed
etc
Hi all,
I'm using some functions from Breeze in a spark job but I get the following
build error :
*Error:scalac: bad symbolic reference. A signature in RandBasis.class
refers to term math3*
*in package org.apache.commons which is not available.*
*It may be completely missing from the current
Hi,
This is the command I am using for submitting my application, SimpleApp:
./bin/spark-submit --class org.apache.spark.examples.SimpleApp
--deploy-mode client --master spark://karthik:7077
$SPARK_HOME/examples/*/scala-*/spark-examples-*.jar /text-data
On Thu, Sep 25, 2014 at 6:52 AM, Tobias
spark-core's dependency on commons-math3 is @ test scope (core/pom.xml):
dependency
groupIdorg.apache.commons/groupId
artifactIdcommons-math3/artifactId
version3.3/version
scopetest/scope
/dependency
Adjusting the scope should solve the problem below.
On Fri, Sep
Hi Frank,
thanks al lot for your response, this is a very helpful!
Actually I'm try to figure out does the current spark version supports
Repetition levels
(https://blog.twitter.com/2013/dremel-made-simple-with-parquet) but now it
looks good to me.
It is very hard to find some good things about
Hi,
This is my first post to the email list so give me some feedback if I do
something wrong.
To do operations on two RDD's to produce a new one you can just use
zipPartitions, but if I have an arbitrary number of RDD's that I would like
to perform an operation on to produce a single RDD, how do
Thank Ted. Can you tell me how to adjust the scope ?
On Fri, Sep 26, 2014 at 5:47 PM, Ted Yu yuzhih...@gmail.com wrote:
spark-core's dependency on commons-math3 is @ test scope (core/pom.xml):
dependency
groupIdorg.apache.commons/groupId
artifactIdcommons-math3/artifactId
Matthes,
Ah, gotcha! Repeated items in Parquet seem to correspond to the ArrayType in
Spark-SQL. I only use Spark, but it does looks like that should be supported in
Spark-SQL 1.1.0. I’m not sure though if you can apply predicates on repeated
items from Spark-SQL.
Regards,
Frank Austin
Shouldn't the user's application depend on commons-math3 if it uses
it? it shouldn't require a Spark change. Maybe I misunderstand.
On Fri, Sep 26, 2014 at 4:47 PM, Ted Yu yuzhih...@gmail.com wrote:
spark-core's dependency on commons-math3 is @ test scope (core/pom.xml):
dependency
You can use scope of runtime.
See
http://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html#Dependency_Scope
Cheers
On Fri, Sep 26, 2014 at 8:57 AM, Jaonary Rabarisoa jaon...@gmail.com
wrote:
Thank Ted. Can you tell me how to adjust the scope ?
On Fri, Sep 26,
I solve the problem by including the commons-math3 package in my sbt
dependencies as Sean suggested. Thanks.
On Fri, Sep 26, 2014 at 6:05 PM, Ted Yu yuzhih...@gmail.com wrote:
You can use scope of runtime.
See
Hi,
Did you manage to figure this out? I would appreciate if you could share the
answer.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-script-with-Tachyon-tp9996p15249.html
Sent from the Apache Spark User List mailing list archive at
I assume you did those things in all machines, not just on the machine
launching the job?
I've seen that workaround used successfully (well, actually, they
copied the library to /usr/lib or something, but same idea).
On Thu, Sep 25, 2014 at 7:45 PM, taqilabon g945...@gmail.com wrote:
You're
I've had multiple jobs crash due to java.io.IOException: unexpected
exception type; I've been running the 1.1 branch for some time and am now
running the 1.1 release binaries. Note that I only use PySpark. I haven't
kept detailed notes or the tracebacks around since there are other problems
that
We removed commons-math3 from dependencies to avoid version conflict
with hadoop-common. hadoop-common-2.3+ depends on commons-math3-3.1.1,
while breeze depends on commons-math3-3.3. 3.3 is not backward
compatible with 3.1.1. So we removed it because the breeze functions
we use do not touch
There are numerous ways to combine RDDs. In your case, it seems you have
several RDDs of the same type and you want to do an operation across all of
them as if they were a single RDD. The way to do this is SparkContext.union
or RDD.union, which have minimal overhead. The only difference between
No for me as well it is non-deterministic. It happens in a piece of code
that does many filter and counts on a small set of records (~1k-10k). The
originally set is persisted in memory and we have a Kryo serializer set for
it. The task itself takes in just a few filtering parameters. This with
FWIW I suspect that each count operation is an opportunity for you to
trigger the bug, and each filter operation increases the likelihood of
setting up the bug. I normally don't come across this error until my job
has been running for an hour or two and had a chance to build up longer
lineages
Hi Davies
The real issue is about cluster management. I am new to the spark world and
am not a system administrator. It seem like the problem is with the
spark-ec2 launch script. It is installing old version of python
In the mean time I am trying to figure out how I can manually install the
Are you able to use the regular PySpark shell on your EC2 cluster? That
would be the first thing to confirm is working.
I don’t know whether the version of Python on the cluster would affect
whether IPython works or not, but if you want to try manually upgrading
Python on a cluster launched by
What is the error? Could you file a JIRA for it?
Turns out there's actually 3 separate errors (indicated below), one of
which **silently returns the wrong value to the user*.* Should I file a
separate JIRA for each one? What level should I mark these as (critical,
major, etc.)?
I'm not sure
Folks -- we're happy to share the videos of Spark talks made at SF
Scala meetup (sfscala.org) and Scala By the Bay conference
(scalabythebay.org). We thank Databricks for presenting and also
sponsoring the first talk video, which was a joint event with SF Bay
Area Machine Learning meetup.
-- Forwarded message --
From: Liquan Pei liquan...@gmail.com
Date: Fri, Sep 26, 2014 at 1:33 AM
Subject: Re: Spark SQL question: is cached SchemaRDD storage controlled by
spark.storage.memoryFraction?
To: Haopu Wang hw...@qilinsoft.com
Hi Haopu,
Internally, cactheTable on a
Hello,
Can someone please explain me how the various threads within a single worker
(and hence a single JVM instance) communicate with each other. I mean how do
they send intermediate data/RDDs to each other? Is it done through network?
Please also point me to the location in source code where I
Many many thanks
Andy
From: Nicholas Chammas nicholas.cham...@gmail.com
Date: Friday, September 26, 2014 at 11:24 AM
To: Andrew Davidson a...@santacruzintegration.com
Cc: Davies Liu dav...@databricks.com, user@spark.apache.org
user@spark.apache.org
Subject: Re: problem with spark-ec2 launch
Hi,
I was loading data into a partitioned table on Spark 1.1.0
beeline-thriftserver. The table has complex data types such as mapstring,
string and arraymapstring,string. The query is like ³insert overwrite
table a partition (Š) select Š² and the select clause worked if run
separately. However,
It might be a problem when inserting into a partitioned table. It worked
fine to when the target table was unpartitioned.
Can you confirm this?
Thanks,
Du
On 9/26/14, 4:48 PM, Du Li l...@yahoo-inc.com.INVALID wrote:
Hi,
I was loading data into a partitioned table on Spark 1.1.0
the receiver is not running on the machine I expect
2014-09-26 14:09 GMT+08:00 Sean Owen so...@cloudera.com:
I think you may be missing a key word here. Are you saying that the machine
has multiple interfaces and it is not using the one you expect or the
receiver is not running on the
Would you mind to provide the DDL of this partitioned table together
with the query you tried? The stacktrace suggests that the query was
trying to cast a map into something else, which is not supported in
Spark SQL. And I doubt whether Hive support casting a complex type to
some other type.
This fix is reasonable, since the actual constructor gets called is
|Driver()| rather than |Driver(HiveConf)|. The former initializes the
|conf| field by:
|conf = SessionState.get().getConf()
|
And |SessionState.get()| reads a TSS value. Thus executing SQL queries
within another thread
Would you mind to provide the DDL of this partitioned table together
with the query you tried? The stacktrace suggests that the query was
trying to cast a map into something else, which is not supported in
Spark SQL. And I doubt whether Hive support casting a complex type to
some other type.
56 matches
Mail list logo