Hi,
We use UpdateStateByKey, reduceByKeyWindow and checkpoint the data. We
store the offsets in Zookeeper. How to make sure that the state of the job
is maintained upon redeploying the code?
Thanks!
--
View this message in context:
Hi,
I have checkpoints enabled in Spark streaming and I use updateStateByKey and
reduceByKeyAndWindow with inverse functions. How do I reduce the amount of
data that I am writing to the checkpoint or clear out the data that I dont
care?
Thanks!
--
View this message in context:
Hi all,
I have a spark standalone cluster. I am running a spark streaming
application on it and the deploy mode is client. I am looking for the best
way to monitor the cluster and application so that I will know when the
application/cluster is down. I cannot move to cluster deploy mode now.
I
I'm assuming some things here, but hopefully I understand. So, basically
you have a big table of data distributed across a bunch of executors. And,
you want an efficient way to call a native method for each row.
It sounds similar to a dataframe writer to me. Except, instead of writing
to disk or
Depends on the need. For data exploration, i use notebooks whenever I can.
For developement, any good text editor should work, I use sublime. If you
want auto completion and all, you can use eclipse or pycharm, I do not :)
On Wed, 28 Jun 2017 at 7:17 am, Xiaomeng Wan wrote:
Hi,
I recently switched from scala to python, and wondered which IDE people are
using for python. I heard about pycharm, spyder etc. How do they compare
with each other?
Thanks,
Shawn
I am using Spark via Java for a MYSQL/ML(machine learning) project.
In the mysql database, I have a column "status_change_type" of type enum =
{broke, fixed} in a table called "status_change" in a DB called "test".
I have an object StatusChangeDB that constructs the needed structure for the
Hi,
I am using Apache spark 2.0.2 randomforest ml (standalone mode) for text
classification. TF-IDF feature extractor is also used. The training part
runs without any issues and returns 100% accuracy. But when I am trying to
do prediction using trained model and compute test error, it fails with
Hi,
How do I find the time taken by each step in a stage in spark job? Also, how
do I find the bottleneck in each step and if a stage is skipped because of
the RDDs being persisted in streaming?
I am trying to identify which step in a job is taking time in my Streaming
job.
Thanks!
--
View
thanks, I was able to get it up and running.
One think I am not entirely sure if bahir provided python bindings to
ZeroMQ. Looking at the code it does not seems like but I might be wrong.
thanks,
On Mon, Jun 26, 2017 at 5:13 PM Aashish Chaudhary <
aashish.chaudh...@kitware.com> wrote:
>
Thanks Bryan. This is one Spark application with one job. This job has 3
stages. The first 2 are basic reads from cassandra tables and the 3rd is a
join between the two. I was expecting the first 2 stages to run in
parallel, however they run serially. Job has enough resources.
On Tue, Jun 27,
HI all,
I have code like below:
Logger.getLogger("org.apache.spark").setLevel( Level.ERROR)
//Logger.getLogger("org.apache.spark.streaming.dstream").setLevel(
Level.DEBUG)
val conf = new SparkConf().setAppName("testDstream").setMaster("local[4]")
//val sc =
Hello,
Reading around on the theory behind tree based regression, I concluded
that there are various reasons to stop exploring the tree when a given
node has been reached. Among these, I have those two:
1. When starting to process a node, if its size (row count) is less than
X then consider
Hi all,
I am using Hadoop 2.6.5 and spark 2.1.0 and run a job using spark-submit
and master is set to "yarn". When spark starts, I can load Spark UI page
using port 4040 but no job is shown in the page. After the following logs
(registering application master on yarn) spark UI is not accessible
Hello spark gurus,
Could you please shed some light on what is the purpose of having two
identical functions in RDD,
RDD.context [1] and RDD.sparkContext [2].
RDD.context seems to be used more frequently across the source code.
[1]
Hello all, I am running PySpark 2.1.1 as a user, jomernik. I am working
through some documentation here:
https://spark.apache.org/docs/latest/mllib-ensembles.html#random-forests
And was working on the Random Forest Classification, and found it to be
working! That said, when I try to save the
Satish,
Is this two separate applications submitted to the Yarn scheduler? If so then
you would expect that you would see the original case run in parallel.
However, if this is one application your submission to Yarn guarantees that
this application will fairly contend with resources
Thanks All. To reiterate - stages inside a job can be run parallely as long
as - (a) there is no sequential dependency (b) the job has sufficient
resources.
however, my code was launching 2 jobs and they are sequential as you
rightly pointed out.
The issue which I was trying to highlight with that
Thanks a lot.
Thanks & Regards
Saroj Kumar Choudhury
Tata Consultancy Services
(UNIT-I)- KALINGA PARK
IT/ITES SPECIAL ECONOMIC ZONE (SEZ),PLOT NO. 35,
CHANDAKA INDUSTRIAL ESTATE, PATIA,
Bhubaneswar - 751 024,Orissa
India
Ph:- +91 674 664 5154
Mailto: saro...@tcs.com
Website: http://www.tcs.com
19 matches
Mail list logo