From the chart you pasted, I guess you only have one receiver with storage
level two copies, so mostly your taks are located on two executors. You
could use repartition to redistribute the data more evenly across the
executors. Also add more receiver is another solution.
2015-04-30 14:38
This is spark mailing list :/
Yes, you can configure the following in the mapred-site.xml for that:
property
namemapred.tasktracker.map.tasks.maximum/name
value4/value
/property
Thanks
Best Regards
On Tue, Apr 28, 2015 at 11:00 PM, Shushant Arora shushantaror...@gmail.com
wrote:
In
I am using hbase -0.94.8.
On Wed, Apr 29, 2015 at 11:56 PM, Ted Yu yuzhih...@gmail.com wrote:
Can you enable HBase DEBUG logging in log4j.properties so that we can have
more clue ?
What hbase release are you using ?
Cheers
On Wed, Apr 29, 2015 at 4:27 AM, Saurabh Gupta
Did you have a directory layout like this?
base/
| data-file-2.parquet
| batch_id=1/
| | data-file-1.parquet
Cheng
On 4/28/15 11:20 AM, sranga wrote:
Hi
I am getting the following error when persisting an RDD in parquet format to
an S3 location. This is code that was working in the 1.2
Hi everyone,
Let's assume I have a complex workflow of more than 10 datasources as input
- 20 computations (some creating intermediary datasets and some merging
everything for the final computation) - some taking on average 1 minute to
complete and some taking more than 30 minutes.
What would be
Hello,
we send a lot of small jobs to Spark (up to 500 in a second). When profiling
I see Throwable.getStackTrace() in the top of memory profiler which is
caused by SparkContext.getCallSite - this is memory consuming.
we use Java API, I tried to call SparkContext.setCallSite(-) before
If the data is too huge and is in S3, that'll be a lot of network traffic,
instead, if the data is available in HDFS (with proper replication
available) then it will be faster as most of the time, data will be
available as PROCESS_LOCAL/NODE_LOCAL to the executor.
Thanks
Best Regards
On Wed, Apr
It seems that the data size is only 2.9MB, far less than the default rdd
size. How about put more data into kafka? and what about the number of
topic partitions from kafka?
Best regards,
Lin Hao XU
IBM Research China
Email: xulin...@cn.ibm.com
My Flickr:
Hello,
I would like to add a StructType to DataFrame. What would be the best way
to do it? Not sure if it is possible using withColumn. A possible way is to
convert the dataframe into a RDD[Row], add the struct and then convert it
back to dataframe. But that seems an overkill.
I guess I may have
Hi Akhil,
I discovered the reason for this problem. There was in issue with my
deployment (Google Cloud Platform). My spark master was on a different
region than the slaves. This resulted in huge scheduler delays.
Thanks,
Anshul
On Thu, Apr 30, 2015 at 1:39 PM, Akhil Das
You could try increasing your heap space explicitly. like export
_JAVA_OPTIONS=-Xmx10g, its not the correct approach but try.
Thanks
Best Regards
On Tue, Apr 28, 2015 at 10:35 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
I have a SparkApp that runs completes in 45 mins for 5 files (5*750MB
Hi Madhvi,
If I only install spark on one node, and use spark-submit to run an
application, which are the Worker nodes? Any where are the executors ?
Thanks,
Xiaohe
On Thu, Apr 30, 2015 at 12:52 PM, madhvi madhvi.gu...@orkash.com wrote:
Hi,
Follow the instructions to install on the following
Does this speed up?
val rdd = sc.parallelize(1 to 100*, 30)*
rdd.count
Thanks
Best Regards
On Wed, Apr 29, 2015 at 1:47 AM, Anshul Singhle ans...@betaglide.com
wrote:
Hi,
I'm running the following code in my cluster (standalone mode) via spark
shell -
val rdd = sc.parallelize(1 to
Hi,
you have to specify the worker nodes of the spark cluster at the time of
configurations of the cluster.
Thanks
Madhvi
On Thursday 30 April 2015 01:30 PM, xiaohe lan wrote:
Hi Madhvi,
If I only install spark on one node, and use spark-submit to run an
application, which are the Worker
Hi all
My environment info
Hadoop release version: HDP 2.1
Kakfa: 0.8.1.2.1.4.0
Spark: 1.1.0
My question:
I ran Spark streaming program on YARN. My Spark streaming program will
read data from Kafka and doing some processing. But, I found there is
always only ONE executor under processing. As
Hello Lin Hao
Thanks for your reply. I will try to produce more data into Kafka.
I run three Kafka borkers. Following is my topic info.
Topic:kyle_test_topic PartitionCount:3 ReplicationFactor:2 Configs:
Topic: kyle_test_topic Partition: 0 Leader: 3 Replicas: 3,4 Isr: 3,4
Topic: kyle_test_topic
Have a look at KafkaRDD
https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/kafka/KafkaRDD.html
Thanks
Best Regards
On Wed, Apr 29, 2015 at 10:04 AM, dgoldenberg dgoldenberg...@gmail.com
wrote:
Hi,
I'm wondering about the use-case where you're not doing continuous,
1. Do a group by and get Max. In your example select id, Max(DT) from t
group by id. Name this j.
2. Join t,j on id and DT=mxdt
This is how we used to query RDBMS before window functions show up.
As I understand from SQL, group by allow you to do sum(), average(), max(),
mn(). But how do I
This is how i used to do it:
- Login to the ec2 cluster (master)
- Make changes to the spark, and build it.
- Stop the old installation of spark (sbin/stop-all.sh)
- Copy old installation conf/* to modified version's conf/
- Rsync modified version to all slaves
- do sbin/start-all.sh from the
Same here (using vanilla spark-ec2), I believe this started when I moved
from 1.3.0 to 1.3.1. Only thing unusual with the setup is a few extra
security parameters on EC2 but was the same as in 1.3.0. Did you solve this
problem?
On Thu, Apr 2, 2015 at 8:02 AM, Ganon Pierce ganon.pie...@me.com
You can replace your clusters(on master and workers) assembly jar with your
custom build assembly jar.
Thanks
Best Regards
On Tue, Apr 28, 2015 at 9:45 PM, Bo Fu b...@uchicago.edu wrote:
Hi all,
I have an issue. I added some timestamps in Spark source code and built it
using:
mvn package
One way you could try would be, Inside the map, you can have a synchronized
thread and you can block the map till the thread finishes up processing.
Thanks
Best Regards
On Wed, Apr 29, 2015 at 9:38 AM, Nastooh Avessta (navesta)
nave...@cisco.com wrote:
Hi
In a multi-node setup, I am
I'm unclear why I'm getting this exception.
It seems to have realized that I want to enable Event Logging but ignoring
where I want it to log to i.e. file:/opt/cb/tmp/spark-events which does
exist.
spark-default.conf
# Example:
spark.master spark://master1:7077,master2:7077
Yes there is now such a profile, though it is essentially redundant and
doesn't configure things differently from 2.4. Besides hadoop version of
course. Which is why it hadn't existed before since the 2.4 profile is 2.4+
People just kept filing bugs to add it but the docs are correct : you don't
Hi, in a test on SparkSQL 1.3.0, multiple threads are doing select on a
same SQLContext instance, but below exception is thrown, so it looks
like SQLContext is NOT thread safe? I think this is not the desired
behavior.
==
java.lang.RuntimeException: [1.1] failure: ``insert'' expected but
No, you should read:
if spark.local.dir is specified, spark.local.dir will be ignored.
This has been reworded (hopefully for the best) in 1.3.1:
https://spark.apache.org/docs/1.3.1/running-on-yarn.html
Christophe.
On 17/04/2015 18:18, shenyanls wrote:
According to the documentation:
The
Now able to solve the issue by setting
SparkConf sconf = *new* SparkConf().setAppName(“App).setMaster(local)
and
conf.set(“zookeeper.znode.parent”, “/hbase-unsecure”)
Standalone hbase has a table 'test'
hbase(main):001:0 scan 'test'
ROW COLUMN+CELL
row1
Hi all
Producing more data into Kafka is not effective in my situation,
because the speed of reading Kafka is consistent. I will adopt Saiai's
suggestion to add more receivers.
Kyle
2015-04-30 14:49 GMT+08:00 Saisai Shao sai.sai.s...@gmail.com:
From the chart you pasted, I guess you only
actually this is a sql parse exception, are you sure your sql is right?
发自我的 iPhone
在 2015年4月30日,18:50,Haopu Wang hw...@qilinsoft.com 写道:
Hi, in a test on SparkSQL 1.3.0, multiple threads are doing select on a
same SQLContext instance, but below exception is thrown, so it looks
like
I am at crossroads now and expert advise help me decide what the next
course of the project going to be.
Background : At out company we process tons of data to help build
experimentation platform. We fire more than 300s of M/R jobs, Peta bytes of
data, takes 24 hours and does lots of joins. Its
Thank you guys for the input.
Ayan, I am not sure how this can be done using reduceByKey, as far as I can
see (but I am not so advanced with Spark), this requires a groupByKey which
can be very costly. What would be nice to transform the dataset which
contains all the vectors like:
val
Did not work. Same problem.
On Thu, Apr 30, 2015 at 1:28 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
You could try increasing your heap space explicitly. like export
_JAVA_OPTIONS=-Xmx10g, its not the correct approach but try.
Thanks
Best Regards
On Tue, Apr 28, 2015 at 10:35 PM,
bq. a single query on one filter criteria
Can you tell us more about your filter ? How selective is it ?
Which hbase release are you using ?
Cheers
On Thu, Apr 30, 2015 at 7:23 AM, Siddharth Ubale
siddharth.ub...@syncoms.com wrote:
Hi,
I want to use Spark as Query engine on HBase with
482 MB should be small enough to be distributed as a set of broadcast
variables. Then you can use local features of spark to process.
-Original Message-
From: shahab shahab.mok...@gmail.com
Sent: 4/30/2015 9:42 AM
To: user@spark.apache.org user@spark.apache.org
Subject: is there
JavaDStream.saveAsTextFiles not exists in Spark Java Api. If you want to
persist every RDD in your JavaDStream you should do something like this:
words.foreachRDD(new Function2JavaRDDlt;String, Time, Void() {
@Override
public Void call(JavaRDDSparkFlumeEvent rddWords, Time arg1) throws
The error indicates incompatible protobuf versions.
Please take a look at 4.1.1 under
http://hbase.apache.org/book.html#basic.prerequisites
Cheers
On Thu, Apr 30, 2015 at 3:49 AM, Saurabh Gupta saurabh.gu...@semusi.com
wrote:
Now able to solve the issue by setting
SparkConf sconf = *new*
After repartitioning a *DataFrame* in *Spark 1.3.0* I get a *.parquet*
exception
when saving to*Amazon's S3*. The data that I try to write is 10G.
logsForDate
.repartition(10)
.saveAsParquetFile(destination) // -- Exception here
The exception I receive is:
java.io.IOException: The file
I think this will be slow because you have to do a group by then do a join (my
table has 7 million records). I am looking for something like reduceByKey(),
e.g.
rdd.reduceByKey( (a, b) = if (a.timeStamp b.timeStamp) a else b )
Does it have similar thing in DataFrame? Of course I can convert a
Hi,
I load data from Cassandra into spark The entire data is almost around 482
MB. and it is cached as TempTable in 7 tables. How can I enforce spark to
cache data in both worker nodes not only in ONE worker (as in my case)?
I am using spark 2.1.1 with spark-connector 1.2.0-rc3. I have small
You are right. After I moved from HBase 0.98.1 to 1.0.0 this problem got
solved. Thanks all!
Date: Wed, 29 Apr 2015 06:58:59 -0700
Subject: Re: HBase HTable constructor hangs
From: yuzhih...@gmail.com
To: tridib.sama...@live.com
CC: d...@ocirs.com; user@spark.apache.org
Can you verify whether
Unfortunately, I think the SQLParser is not threadsafe. I would recommend
using HiveQL.
On Thu, Apr 30, 2015 at 4:07 AM, Wangfei (X) wangf...@huawei.com wrote:
actually this is a sql parse exception, are you sure your sql is right?
发自我的 iPhone
在 2015年4月30日,18:50,Haopu Wang
Thought about it some more, and simplified the problem space for
discussions:
Given: JavaPairRDDString, Integer c1; // c1.count() == 8000.
Goal: JavaPairRDDTuple2String,Integer,Tuple2String,Integer c2; // all
lexicographical pairs
Where: all lexicographic permutations on c1 ::
The data ingestion is in outermost portion in foreachRDD block. Although
now I close the statement of jdbc, the same exception happened again. It
seems it is not related to the data ingestion part.
On Wed, Apr 29, 2015 at 8:35 PM, Cody Koeninger c...@koeninger.org wrote:
Use lsof to see what
I am facing same issue, do you have any solution ?
On Mon, Apr 27, 2015 at 9:43 PM, Deepak Gopalakrishnan dgk...@gmail.com
wrote:
Hello All,
I dug a little deeper and found this error :
15/04/27 16:05:39 WARN TransportChannelHandler: Exception in connection from
/10.1.0.90:40590
Full Exception
*15/04/30 09:59:49 INFO scheduler.DAGScheduler: Stage 1 (collectAsMap at
VISummaryDataProvider.scala:37) failed in 884.087 s*
*15/04/30 09:59:49 INFO scheduler.DAGScheduler: Job 0 failed: collectAsMap
at VISummaryDataProvider.scala:37, took 1093.418249 s*
15/04/30 09:59:49 ERROR
Did you use lsof to see what files were opened during the job?
On Thu, Apr 30, 2015 at 1:05 PM, Bill Jay bill.jaypeter...@gmail.com
wrote:
The data ingestion is in outermost portion in foreachRDD block. Although
now I close the statement of jdbc, the same exception happened again. It
seems it
Hi !
Can we load hana database table using spark jdbc RDD?
Thanks
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/sap-hana-database-laod-using-jdbcRDD-tp22726.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
You fundamentally want (half of) the Cartesian product so I don't think it
gets a lot faster to form this. You could implement this on cogroup
directly and maybe avoid forming the tuples you will filter out.
I'd think more about whether you really need to do this thing, or whether
there is
I terminated the old job and now start a new one. Currently, the Spark
streaming job has been running for 2 hours and when I use lsof, I do not
see many files related to the Spark job.
BTW, I am running Spark streaming using local[2] mode. The batch is 5
seconds and it has around 50 lines to read
Hi all,
I have a RDD with *MANY *columns (e.g., *hundreds*), how do I add one more
column at the end of this RDD?
For example, if my RDD is like below:
123, 523, 534, ..., 893
536, 98, 1623, ..., 98472
537, 89, 83640, ..., 9265
7297, 98364, 9, ..., 735
..
29, 94,
And if I may ask, how long it takes in hbase CLI? I would not expect spark
to improve performance of hbase. At best spark will push down the filter
to hbase. So I would try to optimise any additional overhead like bringing
data into spark.
On 1 May 2015 00:56, Ted Yu yuzhih...@gmail.com wrote:
One optimization is to reduce the shuffle by first aggregate locally (only keep
the max for each name), and then reduceByKey.
Thanks.
Zhan Zhang
On Apr 24, 2015, at 10:03 PM, ayan guha
guha.a...@gmail.commailto:guha.a...@gmail.com wrote:
Here you go
t =
Hi Zhan,
How would this be achieved? Should the data be partitioned by name in this
case?
Thank you!
Best,
Wenlei
On Thu, Apr 30, 2015 at 7:55 PM, Zhan Zhang zzh...@hortonworks.com wrote:
One optimization is to reduce the shuffle by first aggregate locally
(only keep the max for each
1) As i am limited with 12G and i was doing a brodcast join (collect data
and then publish), it was throwing OOM. The data size was 25G and limit was
12G, hence i reverted back to regular join.
2) I started using repartitioning, i started with 100 and now trying 200.
At beginning it looked
This looks like a bug. Mind opening a JIRA?
On Thu, Apr 30, 2015 at 3:49 PM, Justin Yip yipjus...@prediction.io wrote:
After some trial and error, using DataType solves the problem:
df.withColumn(millis, $eventTime.cast(
org.apache.spark.sql.types.LongType) * 1000)
Justin
On Thu, Apr
Is new a reserved word for MySQL?
On Thu, Apr 30, 2015 at 2:41 PM, Francesco Bigarella
francesco.bigare...@gmail.com wrote:
Do you know how I can check that? I googled a bit but couldn't find a
clear explanation about it. I also tried to use explain() but it doesn't
really help.
I still
Really not expert here, but try the following ideas:
1) I assume you are using yarn, then this blog is very good about the resource
tuning:
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
2) If 12G is a hard limit in this case, then you have no option but lower
Hi Deepak,
I wrote a couple posts with a bunch of different information about how to
tune Spark jobs. The second one might be helpful with how to think about
tuning the number of partitions and resources? What kind of OOMEs are you
hitting?
After some trial and error, using DataType solves the problem:
df.withColumn(millis, $eventTime.cast(
org.apache.spark.sql.types.LongType) * 1000)
Justin
On Thu, Apr 30, 2015 at 3:41 PM, Justin Yip yipjus...@prediction.io wrote:
Hello,
I was able to cast a timestamp into long using
You don't need to install Spark. Just download or build a Spark package
that matches your Yarn version. And ensure that HADOOP_CONF_DIR or
YARN_CONF_DIR points to the directory which contains the (client side)
configuration files for the Hadoop cluster.
See instructions here:
Hello,
I was able to cast a timestamp into long using
df.withColumn(millis, $eventTime.cast(long) * 1000)
in spark 1.3.0.
However, this statement returns a failure with spark 1.3.1. I got the
following exception:
Exception in thread main org.apache.spark.sql.types.DataTypeException:
Unsupported
I’m very perplexed with the following. I have a set of AVRO generated
objects that are sent to a SparkStreaming job via Kafka. The SparkStreaming
job follows the receiver-based approach. I am encountering the below error
when I attempt to de serialize the payload:
15/04/30 17:49:25 INFO
spark.history.fs.logDirectory is for the history server. For Spark
applications, they should use spark.eventLog.dir. Since you commented out
spark.eventLog.dir, it will be /tmp/spark-events. And this folder does
not exits.
Best Regards,
Shixiong Zhu
2015-04-29 23:22 GMT-07:00 James King
Do you know how I can check that? I googled a bit but couldn't find a clear
explanation about it. I also tried to use explain() but it doesn't really
help.
I still find unusual that I have this issue only for the equality operator
but not for the others.
Thank you,
F
On Wed, Apr 29, 2015 at 3:03
A tad off topic, but could still be relevant.
Accumulo's design is a tad different in the realm of being able to shard
and perform set intersections/unions server-side (through seeks). I've got
an adapter for Spark SQL on top of a document store implementation in
Accumulo that accepts the
I think you need to specify new in single quote. My guess is the query
showing up in dB is like
...where status=new or
...where status=new
Either case mysql assumes new is a column.
What you need is the form below
...where status='new'
You need to provide your quotes accordingly.
Easiest way
Hi,
We have a scenario as below and would like your suggestion.
We have app.conf file with propX=A as default built into the fat jar file
that is provided to spark-submit
WE have env.conf file with propX=B that would like spark-submit to take as
input to overwrite the default and populate to
I am trying to generate all (N-1)(N)/2 lexicographical 2-tuples from a
glom() JavaPairRDDListTuple2. The construction of these initial
Tuple2's JavaPairRDDAQ,Integer space is well formed from case classes I
provide it (AQ, AQV, AQQ, CT) and is performant; minimized code:
SparkConf conf = new
Hi,
We want to have Marathon starting and monitoring Chronos, so that when
Chronos based Spark job fails, marathon automatically restarts them in
scope of Chronos. Will this approach work if we start Spark jobs as shell
scripts from Chronos or Marathon?
Hi,
I want to use Spark as Query engine on HBase with sub second latency.
I am using Spark 1.3 version. And followed the steps below on Hbase table
with around 3.5 lac rows :
1. Mapped the Dataframe to Hbase table .RDDCustomers maps to the hbase
table which is used to create the
Thanks. Sounds reasonable, perhaps blocking on local fifo or a socket would do.
Cheers,
[http://www.cisco.com/web/europe/images/email/signature/logo05.jpg]
Nastooh Avessta
ENGINEER.SOFTWARE ENGINEERING
nave...@cisco.com
Phone: +1 604 647 1527
Cisco Systems Limited
595 Burrard Street, Suite 2123
After repartitioning a DataFrame in Spark 1.3.0 I get a .parquet exception
when saving toAmazon's S3. The data that I try to write is 10G.
logsForDate
.repartition(10)
.saveAsParquetFile(destination) // -- Exception here
The exception I receive is:
java.io.IOException: The file being
Thanks Alex, but 482MB was just example size, and I am looking for
generic approach doing this without broadcasting,
any idea?
best,
/Shahab
On Thu, Apr 30, 2015 at 4:21 PM, Alex lxv...@gmail.com wrote:
482 MB should be small enough to be distributed as a set of broadcast
variables. Then
Looking at your classpath, it looks like you've compiled Spark yourself.
Depending on which version of Hadoop you've compiled against (looks like
it's Hadoop 2.2 in your case), Spark will have its own version of
protobuf. You should try by making sure both your HBase and Spark are
compiled
74 matches
Mail list logo