-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Hi,
I am building a mesos cluster for the purposes of using it to run
spark workloads (in addition to other frameworks). I am under the
impression that it is preferable/recommended to run hdfs datanode
process, spark slave on the same physical node
Are you using DSE spark, if so are you pointing spark job server to use DSE
spark?
Thanks and Regards
Noorul
Anand anand.vi...@monotype.com writes:
*I am new to Spark world and Job Server
My Code :*
package spark.jobserver
import java.nio.ByteBuffer
import
I renamed spark-defaults.conf.template to spark-defaults.conf
and invoked
spark-1.3.0-bin-hadoop2.4/sbin/start-slave.sh
But I still get
failed to launch org.apache.spark.deploy.worker.Worker:
--properties-file FILE Path to a custom Spark properties file.
How can one disable *Partition discovery* in *Spark 1.3.0 *when using
*sqlContext.parquetFile*?
Alternatively, is there a way to load *.parquet* files without *Partition
discovery*?
Cosmin
FYI,
Parquet schema output:
message pig_schema {
optional binary cust_id (UTF8);
optional int32 part_num;
optional group ip_list (LIST) {
repeated group ip_t {
optional binary ip (UTF8);
}
}
optional group vid_list (LIST) {
repeated group vid_t {
optional binary
Hi everyone,
I know that any RDD is related to its SparkContext and the associated
variables (broadcast, accumulators), but I'm looking for a way to
serialize/deserialize full RDD computations ?
@rxin Spark SQL is, in a way, already doing this but the parsers are
private[sql], is there any way to
I was able to get it working. Instead of using customers.flatMap to return
alerts. I had to use the following:
customers.foreachRDD(new FunctionJavaPairRDDlt;String,
Iterablelt;QueueEvent, Void() {
@Override
public Void call(final JavaPairRDDString,
Iterablelt;QueueEvent
You should distribute your configuration file to workers and set the
appropriate environment variables, like HADOOP_HOME, SPARK_HOME,
HADOOP_CONF_DIR, SPARK_CONF_DIR.
On Mon, Apr 27, 2015 at 12:56 PM James King jakwebin...@gmail.com wrote:
I renamed spark-defaults.conf.template to
Hi all, following the
import com.datastax.spark.connector.SelectableColumnRef;
import com.datastax.spark.connector.japi.CassandraJavaUtil;
import org.apache.spark.sql.SchemaRDD;
import static com.datastax.spark.connector.util.JavaApiHelper.toScalaSeq;
import scala.collection.Seq;
SchemaRDD
I was able to fix the issues by providing right version of cassandra-all and
thrift libraries
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Cassandra-Connection-Issue-with-Spark-jobserver-tp22587p22664.html
Sent from the Apache Spark User List mailing
Hi,
I'm trying, after reducing by key, to get data ordered among partitions
(like RangePartitioner) and within partitions (like sortByKey or
repartitionAndSortWithinPartition) pushing the sorting down to the
shuffles machinery of the reducing phase.
I think, but maybe I'm wrong, that the correct
Hi, and what can I do when I am on Windows?
It does not allow me to set the hostname to some IP
Thanks,
Tim
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-to-a-cluster-behind-a-VPN-configuring-different-IP-address-tp9360p22674.html
Sent
This is available for 1.3.1:
http://mvnrepository.com/artifact/org.apache.spark/spark-hive-thriftserver_2.10
FYI
On Mon, Feb 16, 2015 at 7:24 AM, Marco marco@gmail.com wrote:
Ok, so will it be only available for the next version (1.30)?
2015-02-16 15:24 GMT+01:00 Ted Yu
It's not related to Spark, but the concept of what you are trying to do
with the data. Grouping by ID means consolidating data for each ID down to
1 row per ID. You can sort by time after this point yes, but you would need
to either take each ID and time value pair OR do some aggregate operation
With the Python APIs, the available arguments I got (using inspect module) are
the following:
['cls', 'data', 'iterations', 'step', 'miniBatchFraction', 'initialWeights',
'regParam', 'regType', 'intercept']
numClasses is not available. Can someone comment on this?
Thanks,
Hi,
I am trying to answer a simple query with SparkSQL over the Parquet file.
When execute the query several times, the first run will take about 2s
while the later run will take 0.1s.
By looking at the log file it seems the later runs doesn't load the data
from disk. However, I didn't enable
+user
-- Forwarded message --
From: Burak Yavuz brk...@gmail.com
Date: Mon, Apr 27, 2015 at 1:59 PM
Subject: Re: Change ivy cache for spark on Windows
To: mj jone...@gmail.com
Hi,
In your conf file (SPARK_HOME\conf\spark-defaults.conf) you can set:
`spark.jars.ivy \your\path`
Michael,
There is only one schema: both versions have 200 string columns in one file.
On Mon, Apr 20, 2015 at 9:08 AM, Evo Eftimov evo.efti...@isecc.com wrote:
Now this is very important:
“Normal RDDs” refers to “batch RDDs”. However the default in-memory
Serialization of RDDs which are
Hi,
I'm having trouble using the --packages option for spark-shell.cmd - I have
to use Windows at work and have been issued a username with a space in it
that means when I use the --packages option it fails with this message:
Exception in thread main java.net.URISyntaxException: Illegal
Hi All,
I want to ask how to use UDF when I use join function on DataFrame. It looks
like always give me the cannot solve the column name error.
Anyone can give me an example on how to run this in java?
My code is like:
edmData.join(yb_lookup,
The code themselves are the recipies, no?
On Mon, Apr 27, 2015 at 2:49 AM, Olivier Girardot
o.girar...@lateral-thoughts.com wrote:
Hi everyone,
I know that any RDD is related to its SparkContext and the associated
variables (broadcast, accumulators), but I'm looking for a way to
Hi All,
Basically I try to define a simple UDF and use it in the query, but it gives
me Task not serializable
public void test() {
RiskGroupModelDefinition model =
registeredRiskGroupMap.get(this.modelId);
RiskGroupModelDefinition edm =
Thanks, it should be
“select id, time, min(x1), min(x2), … from data group by id, time order by time”
(“min” or other aggregate function to pick other fields)
Forgot to mention that (id, time) is my primary key and I took for granted that
it worked in my MySQL example.
Best regards, Alexander
You need to look more deep into your worker logs, you may find GC error, IO
exceptions etc if you look closely which is triggering the timeout.
Thanks
Best Regards
On Mon, Apr 27, 2015 at 3:18 AM, Deepak Gopalakrishnan dgk...@gmail.com
wrote:
Hello Patrick,
Sure. I've posted this on user as
Spark 1.3
1. View stderr/stdout from executor from Web UI: when the job is running i
figured out the executor that am suppose to see, and those two links show 4
special characters on browser.
2. Tail on Yarn logs:
/apache/hadoop/bin/yarn logs -applicationId
application_1429087638744_151059 |
Isn't it already available on the driver UI (that runs on 4040)?
Thanks
Best Regards
On Mon, Apr 27, 2015 at 9:55 AM, Wenlei Xie wenlei@gmail.com wrote:
Hi,
I am wondering how should we understand the running time of SparkSQL
queries? For example the physical query plan and the running
The answer is it depends :)
The fact that query runtime increases indicates more shuffle. You may want
to construct rdds based on keys you use.
You may want to specify what kind of node you are using and how many
executors you are using. You may also want to play around with executor
memory
Hi Everyone
We use pcap4j to capture network packets and then use spark streaming to
analyze captured packets. However, we met a strange problem.
If we run our application on spark locally (for example, spark-submit
--master local[2]), then the program runs successfully.
If we run our
1) Application container logs from Web RM UI never load on browser. I
eventually have to kill the browser.
2) /apache/hadoop/bin/yarn logs -applicationId
application_1429087638744_151059
| less emits logs only after the application has completed.
Are there no better ways to see the logs as they
Hi,
I am a graduate student from Virginia Tech (USA) pursuing my Masters in
Computer Science. I’ve been researching on parallel and distributed databases
and their performance for running some Range queries involving simple joins and
group by on large datasets. As part of my research, I tried
You can check container logs from RM web UI or when log-aggregation is
enabled with the yarn command. There are other, but less convenient options.
On Mon, Apr 27, 2015 at 8:53 AM ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
Spark 1.3
1. View stderr/stdout from executor from Web UI: when the job
Hi Marco,
As I know, current combineByKey() does not expose the related argument
where you could set keyOrdering on the ShuffledRDD, since ShuffledRDD is
package private, if you can get the ShuffledRDD through reflection or other
way, the keyOrdering you set will be pushed down to shuffle. If you
Thanks.
I've set SPARK_HOME and SPARK_CONF_DIR appropriately in .bash_profile
But when I start worker like this
spark-1.3.0-bin-hadoop2.4/sbin/start-slave.sh
I still get
failed to launch org.apache.spark.deploy.worker.Worker:
Default is conf/spark-defaults.conf.
hi all,
I have just come across a problem where I have a table that has a few bigint
columns, it seems if I read that table into a dataframe then collect it in
pyspark, the bigints are stored and integers in python.
(The problem is if I write it back to another table, I detect the hive type
A short update: eventually we manually upgraded to 1.3.1 and the problem
fixed.
On Apr 26, 2015 2:26 PM, Ophir Cohen oph...@gmail.com wrote:
I happened to hit the following issue that prevents me from using UDFs
with case classes: https://issues.apache.org/jira/browse/SPARK-6054.
The issue
Hi, all:
I use function updateStateByKey in Spark Streaming, I need to store the states
for one minite, I set spark.cleaner.ttl to 120, the duration is 2 seconds,
but it throws Exception
Caused by:
org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does
not exist:
Which hadoop release are you using ?
Can you check hdfs audit log to see who / when deleted spark/ck/hdfsaudit/
receivedData/0/log-1430139541443-1430139601443 ?
Cheers
On Mon, Apr 27, 2015 at 6:21 AM, Sea 261810...@qq.com wrote:
Hi, all:
I use function updateStateByKey in Spark Streaming, I
This a hadoop-side stack trace
it looks like the code is trying to get the filesystem permissions by running
%HADOOP_HOME%\bin\WINUTILS.EXE ls -F
and something is triggering a null pointer exception.
There isn't any HADOOP- JIRA with this specific stack trace in it, so it's not
a
I make it to 240, it happens again when 240 seconds is reached.
-- --
??: 261810726;261810...@qq.com;
: 2015??4??27??(??) 10:24
??: Ted Yuyuzhih...@gmail.com;
: ?? Exception in using updateStateByKey
Yes??I can make
Thanks. So may I know what is your configuration for more/smaller executors on
r3.8xlarge, how big of the memory that you eventually decide to give one
executor without impact performance (for example: 64g? ).
From: Sven Krasser [mailto:kras...@gmail.com]
Sent: Friday, April 24, 2015 1:59
I face the similar issue in Spark 1.2. Cache the schema RDD takes about 50s
for 400MB data. The schema is similar to the TPC-H LineItem.
Here is the code I tried the cache. I am wondering if there is any setting
missing?
Thank you so much!
lineitemSchemaRDD.registerTempTable(lineitem);
Hi,
I am frequently asked why spark is also much faster than Hadoop MapReduce on
disk (without the use of memory cache). I have no convencing answer for this
question, could you guys elaborate on this? Thanks!
I believe the typical answer is that Spark is actually a bit slower.
On Mon, Apr 27, 2015 at 7:34 PM bit1...@163.com bit1...@163.com wrote:
Hi,
I am frequently asked why spark is also much faster than Hadoop MapReduce
on disk (without the use of memory cache). I have no convencing answer for
https://issues.apache.org/jira/browse/SPARK-7182
Can anyone suggest a workaround for the above issue?
Thanks.
-Don
--
Donald Drake
Drake Consulting
http://www.drakeconsulting.com/
http://www.MailLaunder.com/
It works on a smaller dataset of 100 rows. Probably I could find the size when
it fails using binary search. However, it would not help me because I need to
work with 2B rows.
From: ayan guha [mailto:guha.a...@gmail.com]
Sent: Monday, April 27, 2015 6:58 PM
To: Ulanov, Alexander
Cc:
Spark keeps job in memory by default for kind of performance gains you are
seeing. Additionally depending on your query spark runs stages and any
point of time spark's code behind the scene may issue explicit cache. If
you hit any such scenario you will find those cached objects in UI under
Is it? I learned somewhere else that spark's speed is 5~10 times faster than
Hadoop MapReduce.
bit1...@163.com
From: Ilya Ganelin
Date: 2015-04-28 10:55
To: bit1...@163.com; user
Subject: Re: Why Spark is much faster than Hadoop MapReduce even on disk
I believe the typical answer is that
Hi Forum
I am facing below compile error when using the fileStream method of the
JavaStreamingContext class.
I have copied the code from JavaAPISuite.java test class of spark test code.
Please help me to find a solution for this.
Hi
Can you test on a smaller dataset to identify if it is cluster issue or
scaling issue in spark
On 28 Apr 2015 11:30, Ulanov, Alexander alexander.ula...@hp.com wrote:
Hi,
I am running a group by on a dataset of 2B of RDD[Row [id, time, value]]
in Spark 1.3 as follows:
“select id,
I am running following code on Spark 1.3.0. It is from
https://spark.apache.org/docs/1.3.0/ml-guide.html
On running val model1 = lr.fit(training.toDF) I get
java.lang.UnsupportedOperationException: empty collection
what could be the reason?
import org.apache.spark.{SparkConf,
Hi
I am getting the following error when persisting an RDD in parquet format to
an S3 location. This is code that was working in the 1.2 version. The
version that it is failing to work is 1.3.1.
Any help is appreciated.
Caused by: java.lang.AssertionError: assertion failed: Conflicting
http://www.datascienceassn.org/content/making-sense-making-sense-performance-data-analytics-frameworks
From: bit1...@163.com bit1...@163.com
To: user user@spark.apache.org
Sent: Monday, April 27, 2015 8:33 PM
Subject: Why Spark is much faster than Hadoop MapReduce even on disk
Similar to what Dean called out, we build Puppet manifests so we could do
the automation - its a bit of work to setup, but well worth the effort.
On Fri, Apr 24, 2015 at 11:27 AM Dean Wampler deanwamp...@gmail.com wrote:
It's mostly manual. You could try automating with something like Chef, of
Hello,
I am trying to use the default Spark cluster manager in a production
environment. I will be submitting jobs with spark-submit. I wonder if the
following is possible:
1. Get the Driver ID from spark-submit. We will use this ID to keep track
of the job and kill it if necessary.
2. Weather
Maybe I found the solution??do not set 'spark.cleaner.ttl', just use function
'remember' in StreamingContext to set the rememberDuration.
-- --
??: Ted Yu;yuzhih...@gmail.com;
: 2015??4??27??(??) 10:20
??: Sea261810...@qq.com;
Hi,
I have been trying out spark data source api with JDBC. The following is
the code to get DataFrame,
Try(hc.load(org.apache.spark.sql.jdbc,Map(url - dbUrl,dbtable-s($
query) )))
By looking at test cases, I found that query has to be inside brackets,
otherwise it's treated as table name.
I downloaded 1.3.1 hadoop 2.4 prebuilt package (tar) from multiple mirrors
and direct link. Each time i untar i get below error
spark-1.3.1-bin-hadoop2.4/lib/spark-assembly-1.3.1-hadoop2.4.0.jar: (Empty
error message)
tar: Error exit delayed from previous errors
Is it broken ?
--
Deepak
Marco - why do you want data sorted both within and across partitions? If you
need to take an ordered sequence across all your data you need to either
aggregate your RDD on the driver and sort it, or use zipWithIndex to apply an
ordered index to your data that matches the order it was stored on
Hi guys,
I am running some SQL queries, but all my tasks are reported as either
NODE_LOCAL or PROCESS_LOCAL.
In case of Hadoop world, the reduce tasks are RACK or NON_RACK LOCAL because
they have to aggregate data from multiple hosts. However, in Spark even the
aggregation stages are reported
What command are you using to untar? Are you running out of disk space?
Sent with Good (www.good.com)
-Original Message-
From: ÐΞ€ρ@Ҝ (๏̯͡๏) [deepuj...@gmail.commailto:deepuj...@gmail.com]
Sent: Monday, April 27, 2015 11:44 AM Eastern Standard Time
To: user
Subject: Spark 1.3.1 Hadoop
Hello All,
I dug a little deeper and found this error :
15/04/27 16:05:39 WARN TransportChannelHandler: Exception in
connection from /10.1.0.90:40590
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at
Hi Q,
fold and reduce both aggregate over a collection by implementing an
operation you specify, the major different is the starting point of the
aggregation. For fold(), you have to specify the starting value, and for
reduce() the starting value is the first (or possibly an arbitrary) element
in
How did you run the example app? Did you use spark-submit? -Xiangrui
On Thu, Apr 23, 2015 at 2:27 PM, Su She suhsheka...@gmail.com wrote:
Sorry, accidentally sent the last email before finishing.
I had asked this question before, but wanted to ask again as I think
it is now related to my pom
Could you try different ranks and see whether the task size changes?
We do use YtY in the closure, which should work the same as broadcast.
If that is the case, it should be safe to ignore this warning.
-Xiangrui
On Thu, Apr 23, 2015 at 4:52 AM, Christian S. Perone
christian.per...@gmail.com
Works fine for me. Make sure you're not downloading the HTML
redirector page and thinking it's the archive.
On Mon, Apr 27, 2015 at 11:43 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
I downloaded 1.3.1 hadoop 2.4 prebuilt package (tar) from multiple mirrors
and direct link. Each time i untar i
You might need to specify driver memory in spark-submit instead of
passing JVM options. spark-submit is designed to handle different
deployments correctly. -Xiangrui
On Thu, Apr 23, 2015 at 4:58 AM, Rok Roskar rokros...@gmail.com wrote:
ok yes, I think I have narrowed it down to being a problem
Hi,
Would you mind adding our company to the Powered By Spark page?
organization name: atp
URL: https://atp.io
a list of which Spark components you are using: SparkSQL, MLLib, Databricks
Cloud
and a short description of your use case: Predictive models and
learning algorithms to improve the
Hi There,
I am using spark sql left out join query.
The sql query is
scala val test = sqlContext.sql(SELECT e.departmentID FROM employee e LEFT
OUTER JOIN department d ON d.departmentId = e.departmentId).toDF()
In the spark 1.3.1 its working fine, but the latest pull is give the below error
Hello, everyone.
I develop stream application, working with window functions - each
window create table and perform some SQL-operations on extracted data.
I met such problem: when using window operations and checkpointing,
application does not start next time.
Here is the code:
Suppose I have something like the code below
for idx in xrange(0, 10):
train_test_split = training.randomSplit(weights=[0.75, 0.25])
train_cv = train_test_split[0]
test_cv = train_test_split[1]
# scale train_cv and test_cv
by scaling
Hello Xiangrui,
I am using this spark-submit command (as I do for all other jobs):
/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/spark/bin/spark-submit
--class MLlib --master local[2] --jars $(echo
/home/ec2-user/sparkApps/learning-spark/lib/*.jar | tr ' ' ',')
I have this Spark App and it fails when i run with 6 executors but succeeds
with 8.
Any suggestions ?
Command:
./bin/spark-submit -v --master yarn-cluster --driver-class-path
Hi,
While connecting to accumulo through spark by making sparkRDD I am
getting the following error:
object not serializable (class: org.apache.accumulo.core.data.Key)
This is due to the 'key' class of accumulo which does not implement
serializable interface.How it can be solved and accumulo
Can u write the code? Maby is the foreachRDD body. :)
El martes, 28 de abril de 2015, CH.KMVPRASAD [via Apache Spark User List]
ml-node+s1001560n22681...@n3.nabble.com escribió:
When i run spark streaming application print method is printing result it
is f9, but i used foreachrdd on that
We will try to make them available in 1.4, which is coming soon. -Xiangrui
On Thu, Apr 23, 2015 at 10:18 PM, Pagliari, Roberto
rpagli...@appcomsci.com wrote:
I know grid search with cross validation is not supported. However, I was
wondering if there is something availalable for the time being.
So I installed spark on each of the slaves 1.3.1 built with hadoop2.6 I just
basically got the pre-built from the spark website…
I placed those compiled spark installs on each slave at /opt/spark
My spark properties seem to be getting picked up on my side fine…
Hi,
Could you suggest what is the best way to do group by x order by y in Spark?
When I try to perform it with Spark SQL I get the following error (Spark 1.3):
val results = sqlContext.sql(select * from sample group by id order by time)
org.apache.spark.sql.AnalysisException: expression 'time'
Hi,
that error seems to indicate the basic query is not properly expressed. If
you group by just ID, then that means it would need to aggregate all the
time values into one value per ID, so you can't sort by it. Thus it tries
to suggest an aggregate function for time so you can have 1 value per
Hi Richard,
There are several values of time per id. Is there a way to perform group by id
and sort by time in Spark?
Best regards, Alexander
From: Richard Marscher [mailto:rmarsc...@localytics.com]
Sent: Monday, April 27, 2015 12:20 PM
To: Ulanov, Alexander
Cc: user@spark.apache.org
Subject:
Hi Everyone!
I'm trying to understand how Spark's cache work.
Here is my naive understanding, please let me know if I'm missing something:
val rdd1 = sc.textFile(some data)
rdd.cache() //marks rdd as cached
val rdd2 = rdd1.filter(...)
val rdd3 = rdd1.map(...)
rdd2.saveAsTextFile(...)
80 matches
Mail list logo