Can you try running the spark-shell in yarn-cluster mode?
./bin/spark-shell --master yarn-client
Read more over here http://spark.apache.org/docs/1.0.0/running-on-yarn.html
Thanks
Best Regards
On Sun, Sep 28, 2014 at 7:08 AM, jamborta jambo...@gmail.com wrote:
hi all,
I have a job that
Hi All,
After some hair pulling, I've reached the realisation that an operation I
am currently doing via:
myRDD.groupByKey.mapValues(func)
should be done more efficiently using aggregateByKey or combineByKey. Both
of these methods would do, and they seem very similar to me in terms of
their
Hi Krishna,
Some planned features for MLlib 1.2 can be found via Spark JIRA:
http://bit.ly/1ywotkm , though this list is not fixed. The feature
freeze will happen by the end of Oct. Then we will cut branch-1.2 and
start QA. I don't recommend using branch-1.2 for hands-on tutorial
around Oct 29th
The test accuracy doesn't mean the total loss. All points between (-1,
1) can separate points -1 and +1 and give you 1.0 accuracy, but their
coressponding loss are different. -Xiangrui
On Sun, Sep 28, 2014 at 2:48 AM, Yanbo Liang yanboha...@gmail.com wrote:
Hi
We have used LogisticRegression
Hi Dave,
You can replace groupByKey with reduceByKey to improve performance in some
cases. reduceByKey performs map side combine which can reduce Network IO
and shuffle size where as groupByKey will not perform map side combine.
combineByKey is more general then aggregateByKey. Actually, the
Can you check the loss of both LBFGS and SGD implementation? One
reason maybe SGD doesn't converge well and you can see that by
comparing both log-likelihoods. One other potential reason maybe the
label of your training data is totally separable, so you can always
increase the log-likelihood by
Thanks Liquan, that was really helpful.
On Mon, Sep 29, 2014 at 5:54 PM, Liquan Pei liquan...@gmail.com wrote:
Hi Dave,
You can replace groupByKey with reduceByKey to improve performance in some
cases. reduceByKey performs map side combine which can reduce Network IO
and shuffle size where
Hi all,
I had some of my queries run on 1.1.0-SANPSHOT at commit b1b20301(Aug 24), but
in current master branch, my queries would not work. I looked into the stderr
file in executor, and find the following lines:
14/09/26 16:52:46 ERROR nio.NioBlockTransferService: Exception handling buffer
Heya,
I started to port the scala-notebook to Spark some weeks ago (but doing it
in my sparse time and for my Spark talks ^^). It's a WIP but works quite
fine ftm, you can check my fork and branch over here:
https://github.com/andypetrella/scala-notebook/tree/spark
Feel free to ask any
Hi,
There is project called Zeppelin.
You can checkout here
https://github.com/NFLabs/zeppelin
Homepage is here.
http://zeppelin-project.org/
It's notebook style tool (like databrics demo, scala-notebook) with nice
UI, with built-in Spark integration.
It's in active development, so don't
Cool!!! I'll give it a try ASAP!
aℕdy ℙetrella
about.me/noootsab
[image: aℕdy ℙetrella on about.me]
http://about.me/noootsab
On Mon, Sep 29, 2014 at 10:48 AM, moon soo Lee leemoon...@gmail.com wrote:
Hi,
There is project called Zeppelin.
You can checkout here
However (I must say ^^) that it's funny that it has been build using usual
plain old Java stuffs :-D.
aℕdy ℙetrella
about.me/noootsab
[image: aℕdy ℙetrella on about.me]
http://about.me/noootsab
On Mon, Sep 29, 2014 at 10:51 AM, andy petrella andy.petre...@gmail.com
wrote:
Cool!!! I'll give it
There're little histories about using Java.
Any feedback is welcomed warmly.
On Mon, Sep 29, 2014 at 5:57 PM, andy petrella andy.petre...@gmail.com
wrote:
However (I must say ^^) that it's funny that it has been build using usual
plain old Java stuffs :-D.
aℕdy ℙetrella
about.me/noootsab
Thank you for all your patient response.
I can conclude that if the data is totally separable or over-fit occurs,
weights may be different.
And it also consistent with my experiment.
I have evaluate two different dataset and the result as followed:
Loss function: LogisticGradient
Regularizer: L2
Hi all,
Regarding a post here a few months ago
http://apache-spark-user-list.1001560.n3.nabble.com/Workers-disconnected-from-master-sometimes-and-never-reconnect-back-tp6240.html
Is there an answer to this?
I saw workers being still active and not reconnecting after they lost
connection to the
Hi:
I want to use SVD in my work. I tried some examples and have some
confusions. The input the 4*3 matrix as follows:
2 0 0
0 3 2
0 3 1
2 0 3
My input file text as follows which is corresponding to the matrix
0 0 2
1 1 3
1 2
The RDD you define has no particular ordering. So the order that you
encounter the elements (rows) with an operation like take or collect
isn't defined. You can try to sort the RDD by the row number before
that key is discarded.
On Mon, Sep 29, 2014 at 2:58 PM, buring qyqb...@gmail.com wrote:
Hello,
I'm exploring SparkSQL and I'm facing issue while using the queries. Any
help on this is appreciated.
I have the following schema once loaded as RDD.
root
|-- data: array (nullable = true)
||-- element: struct (containsNull = false)
|||-- age: integer (nullable = true)
|
In your case, the table has only one row, whose contents is “data”,
which is an array. You need something like |SELECT data[0].name FROM
json_table| to access the |name| field.
On 9/29/14 11:08 PM, vdiwakar.malladi wrote:
Hello,
I'm exploring SparkSQL and I'm facing issue while using the
thanks for the reply.
As I mentioned above, all works in yarn-client mode, the problem
starts when I try to run it in yarn-cluster mode.
(seems that spark-shell does not work in yarn-cluster mode, so cannot
debug that way).
On Mon, Sep 29, 2014 at 7:30 AM, Akhil Das ak...@sigmoidanalytics.com
Thank you so much guys for helping me, but I have some more questions about
it!
Do we have to presort the columns to get the benefits of the run length
encoding or do I have to group the data first and wrap it into a case class?
I try to sort the data first and write it down and I get different
Hi,
I have an error when submitting a Spark SQL application to our Spark
cluster:
14/09/29 16:02:11 WARN scheduler.TaskSetManager: Loss was due to
java.lang.NoClassDefFoundError
*java.lang.NoClassDefFoundError: org/apache/hadoop/mapred/JobConf*
at
Hi,
Do all spark streaming applications use the map operation? or the majority
of them?
Thanks.
Hi Saiph,
Map is used for transformation on your input RDD. If you don't need
transformation of your input, you don't need to use map.
Thanks,
Liquan
On Mon, Sep 29, 2014 at 10:15 AM, Saiph Kappa saiph.ka...@gmail.com wrote:
Hi,
Do all spark streaming applications use the map operation? or
Hello,
I have written a standalone spark job which I run through Ooyala Job
Server. The program is working correctly, now I'm looking into how to
optimize it.
My program without optimization took 4 hours to run. The first optimization
of KyroSerializer and compiling regex pattern and reusing
Hi Nicholas
Yes out of the box PySpark works. My problem is I am using iPython note book
and matlabplot is not found. It seems that out of the box the cluster has an
old version of python and iPython notebook. It was suggested I upgrade
iPython because the new version include matlabplot. This
Hi,
I'm currently investigating whether it's possible in Spark Streaming to send
back ack's to RabbitMQ after a message has gone through the processing
pipeline. The problem is that the Receiver is the one who has the RabbitMQ
channel open for receiving messages, but due to reliability concerns
Thanks for your prompt response.
Still on further note, I'm getting the exception while executing the query.
SELECT data[0].name FROM people where data[0].age =13
*Exception in thread main java.lang.RuntimeException: [1.46] failure:
``UNION'' expected but identifier .age found
SELECT
You will also need to run 'ldconfig' on each host to read the ld.so.conf file
and make it active. You might also need to stop Spark (the JVM) on each
node to cause the loader to reload for those processes.
--
View this message in context:
Need to know the feasibility of the below task. I am thinking of this one to
be a mapreduce-spark effort.
I need to run distributed sliding Window Comparison for digital data
matching on top of Hadoop. The data(Hive Table) will be partitioned,
distributed across data node. Then the window
HI Andy,
I built an anaconda/spark AMI a few months ago. I'm still iterating on it
so if things break please report them. If you want to give it awhirl:
./spark-ec2 -k my_key -i ~/.ssh/mykey.rsa -a ami-3ecd0c56
The nice thing about anaconda is that it come pre-baked with
ipython-notebook,
Hi Nicholas
I wrote some test code and found a way to get my matplotlib script to work
with the out of the box cluster created by spark-ec2
1. I commented out the python inline magic
#%matplotlib inline
2. Replace
#clear_output(wait=True)
clear_output(True)
The instructions of
Thank you. HADOOP_CONF_DIR has been missing.
On Wed, Sep 24, 2014 at 4:48 PM, Matt Narrell matt.narr...@gmail.com
wrote:
Yes, this works. Make sure you have HADOOP_CONF_DIR set on your Spark
machines
mn
On Sep 24, 2014, at 5:35 AM, Petr Novak oss.mli...@gmail.com wrote:
Hello,
if our
What version of Spark did you use? Can you try the master branch?
On Mon, Sep 29, 2014 at 1:52 PM, vdiwakar.malladi
vdiwakar.mall...@gmail.com wrote:
Thanks for your prompt response.
Still on further note, I'm getting the exception while executing the query.
SELECT data[0].name FROM people
Hi Daoyuan,
Do you mind applying this patch and look at the exception again?
https://github.com/apache/spark/pull/2580
It has also been merged in master so if you pull from master, you should
have that.
On Mon, Sep 29, 2014 at 1:17 AM, Wang, Daoyuan daoyuan.w...@intel.com
wrote:
Hi all,
This one explains it nicely
http://www.dofactory.com/topic/1816/spark-performing-a-join-and-getting-results-back-in-a-strongly-typed-collection.aspx
Thanks
Best Regards
On Tue, Sep 30, 2014 at 12:57 AM, Yin Huai huaiyin@gmail.com wrote:
What version of Spark did you use? Can you try the
Sorry. Not that, this one
http://arjon.es/2014/07/01/processing-json-with-spark-sql/
Thanks
Best Regards
On Tue, Sep 30, 2014 at 1:43 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
This one explains it nicely
I am using following releases:
Spark 1.1 (built using */sbt/sbt -Dhadoop.version=2.2.0 -Phive assembly/*) ,
Apache HDFS 2.2
My job is able to create/add/read data in hive, parquet formatted, tables
using HiveContext.
But, after changing schema, job is not able to read existing data and throws
Hi,
I've been using pyspark with my YARN cluster with success. The work I'm
doing involves using the RDD's pipe command to send data through a binary
I've made. I can do this easily in pyspark like so (assuming 'sc' is
already defined):
sc.addFile(./dumb_prog)
t= sc.parallelize(range(10))
Hi,
I read the past posts about partition number, but am still a little confused
about partitioning strategy.
I have a cluster with 8 works and 2 cores for each work. Is it true that the
optimal partition number should be 2-4 * total_coreNumber or should
approximately equal to total_coreNumber?
A task is the work to be done on a partition for a given stage - you
should expect the number of tasks to be equal to the number of partitions
in each stage, though a task might need to be rerun (due to failure or need
to recompute some data).
2-4 times the cores in your cluster should be a good
-- Forwarded message --
From: Liquan Pei liquan...@gmail.com
Date: Mon, Sep 29, 2014 at 2:12 PM
Subject: Re: about partition number
To: anny9699 anny9...@gmail.com
The number of cores available in your cluster determines the number of
tasks that can be run concurrently. If your
We are also seeing this PARSING_ERROR(2) error due to
Caused by: java.io.IOException: failed to uncompress the chunk:
PARSING_ERROR(2)
at
org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:362)
at
Hi Anny,
Much more partitions is not recommended in general as that creates a lot of
small tasks. All the tasks needs to send to worker nodes for execution.
Too many partitions increases task scheduling overhead.
Spark uses synchronous execution model which means that all tasks in a
stage need
I use Spark in a cluster shared with other applications. The number of
nodes (and cores) assigned to my job varies depending on how many unrelated
jobs are running in the same cluster.
Is there any way for me to determine at runtime how many cores have been
allocated to my job, so I can select an
scala user
res19: org.apache.spark.sql.SchemaRDD =
SchemaRDD[0] at RDD at SchemaRDD.scala:98
== Query Plan ==
ParquetTableScan [id#0,name#1], (ParquetRelation
/user/hive/warehouse/user), None
scala order
res20: org.apache.spark.sql.SchemaRDD =
SchemaRDD[72] at RDD at SchemaRDD.scala:98
== Query
apologies for asking yet again about spark memory assumptions, but i cant
seem to keep it in my head.
if i use PairRDDFunctions.cogroup, it returns for every key 2 iterables. do
the contents of these iterables have to fit in memory? or is the data
streamed?
hi all,
I have a problem with my application when I increase the data size over 5GB
(the cluster has about 100GB memory to handle that). First I get this
warning:
WARN TaskSetManager: Lost task 10.1 in stage 4.1 (TID 408, backend-node1):
FetchFailed(BlockManagerId(3, backend-node0, 41484, 0),
Hi Romi,
I've observed this many times as well. So much so that on some clusters I
restart the workers every night in order to maintain these worker - master
connections.
I couldn't find an open SPARK ticket on it so filed
https://issues.apache.org/jira/browse/SPARK-3736 with you and Piotr
Hi Koert,
cogroup is a transformation on RDD and it creates a cogroupRDD and then
perform some transformations on it. When later an action is called, the
compute() method of the cogroupRDD will be called. Roughly speaking, each
element in cogroupRDD is fetched one at a time. Thus the contents of
Hello
I am trying to build a system that does a very simple calculation on a
stream and displays the results in a graph that I want to update the graph
every second or so. I think I have a fundamental mis understanding about how
steams and rdd.pipe() works. I want to do the data visualization
I'll note that the DSL is pretty experimental. That said you should be
able to do something like user.id.attr
On Mon, Sep 29, 2014 at 3:39 PM, Benyi Wang bewang.t...@gmail.com wrote:
scala user
res19: org.apache.spark.sql.SchemaRDD =
SchemaRDD[0] at RDD at SchemaRDD.scala:98
== Query Plan
Hey Ameet,
Thanks for the info, I'm running into the same issue myself and my last
attempt crashed and my ulimit was 16834. I'm going to up it and try again,
but yea I would like to know the best practice for computing this. Can you
talk about the worker nodes, what are their specs? At least 45
I'm using the latest version i.e. Spark 1.1.0
Thanks.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Unresolved-attributes-SparkSQL-on-the-schemaRDD-tp15339p15376.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
I submitted a job in Yarn-Client mode, which simply reads from a HBase
table containing tens of millions of records and then does a *count *action.
The job runs for a much longer time than I expected, so I wonder whether it
was because the data to read was too much. Actually, there are 20 nodes in
I submitted the job in Yarn-Client mode using the following script:
export
SPARK_JAR=/usr/games/spark/xt/spark-assembly_2.10-0.9.0-cdh5.0.1-hadoop2.3.0-cdh5.0.1.jar
export HADOOP_CLASSPATH=$(hbase classpath)
export
can you look at your HBase UI to check whether your job is just reading from a
single region server?
Best,
--
Nan Zhu
On Monday, September 29, 2014 at 10:21 PM, Tao Xiao wrote:
I submitted a job in Yarn-Client mode, which simply reads from a HBase table
containing tens of millions of
Hi Reynold,
Seems I am getting a much larger offset than file size.
reading
org.apache.spark.network.FileSegmentManagedBuffer(/mnt/DP_disk2/animal/spark/spark-local-20140930102549-622d/11/shuffle_6_191_0.data,
3154043, 588396) (actual file length 676025)
at
And the
/mnt/DP_disk2/animal/spark/spark-local-20140930102549-622d/11/shuffle_6_191_0.data
file is comparatively much smaller than other shuffle*.data files
From: Wang, Daoyuan [mailto:daoyuan.w...@intel.com]
Sent: Tuesday, September 30, 2014 10:54 AM
To: Reynold Xin
Cc: user@spark.apache.org
I take a look at HashOuterJoin and it's building a Hashtable for both
sides.
This consumes quite a lot of memory when the partition is big. And it
doesn't reduce the iteration on streamed relation, right?
Thanks!
-
To
Hi, Tao,
When I used newAPIHadoopRDD (Accumulo not HBase) I found that I had to
specify executor-memory and num-executors explicitly on the command line or
else I didn't get any parallelism across the cluster.
I used --executor-memory 3G --num-executors 24 but obviously other
parameters will be
HBase TableInputFormat creates input splits one per each region. You can
not achieve high level of parallelism unless you have 5-10 regions per RS
at least. What does it mean? You probably have too few regions. You can
verify that in HBase Web UI.
-Vladimir Rodionov
On Mon, Sep 29, 2014 at 7:21
Hi Haopu,
My understanding is that the hashtable on both left and right side is used
for including null values in result in an efficient manner. If hash table
is only built on one side, let's say left side and we perform a left outer
join, for each row in left side, a scan over the right side is
Thanks Xiangrui. Appreciate the insights.
I have uploaded the initial version of my presentation at
http://goo.gl/1nBD8N
Cheers
k/
On Mon, Sep 29, 2014 at 12:17 AM, Xiangrui Meng men...@gmail.com wrote:
Hi Krishna,
Some planned features for MLlib 1.2 can be found via Spark JIRA:
Also some lines on another node :
14/09/30 10:22:31 ERROR nio.NioBlockTransferService: Exception handling buffer
message
java.io.IOException: Error in reading
org.apache.spark.network.FileSegmentManagedBuffer(/mnt/DP_disk10/animal/spark/spark-local-20140930101701-c9ee/38/shuffle_6_162_0.data,
65 matches
Mail list logo