Re: Job Fails on sortByKey

2015-02-18 Thread Saisai Shao
Would you mind explaining your problem a little more specifically, like exceptions you met or others, so someone who has experiences on it could give advice. Thanks Jerry 2015-02-19 1:08 GMT+08:00 athing goingon athinggoin...@gmail.com: hi, I have a job that fails on a shuffle during a

Re: Streaming scheduling delay

2015-02-12 Thread Saisai Shao
Hi Tim, I think this code will still introduce shuffle even when you call repartition on each input stream. Actually this style of implementation will generate more jobs (job per each input stream) than union into one stream as called DStream.union(), and union normally has no special overhead as

Re: Streaming scheduling delay

2015-02-12 Thread Saisai Shao
partitions (some will probably sit idle). But do away with dStream partitioning altogether. Right? Thanks, - Tim On Thu, Feb 12, 2015 at 11:03 PM, Saisai Shao sai.sai.s...@gmail.com wrote: Hi Tim, I think maybe you can try this way: create Receiver per executor and specify thread

Re: Streaming scheduling delay

2015-02-12 Thread Saisai Shao
at 9:45 PM, Saisai Shao sai.sai.s...@gmail.com wrote: Hi Tim, I think this code will still introduce shuffle even when you call repartition on each input stream. Actually this style of implementation will generate more jobs (job per each input stream) than union into one stream as called

Re: Why spark master consumes 100% CPU when we kill a spark streaming app?

2015-03-10 Thread Saisai Shao
Probably the cleanup work like clean shuffle files, tmp files cost too much of CPUs, since if we run Spark Streaming for a long time, lots of files will be generated, so cleanup this files before app is exited could be time-consuming. Thanks Jerry 2015-03-11 10:43 GMT+08:00 Tathagata Das

Re: Server IPC version 9 cannot communicate with client version 4

2015-03-25 Thread Saisai Shao
Looks like you have to build Spark with related Hadoop version, otherwise you will meet exception as mentioned. you could follow this doc: http://spark.apache.org/docs/latest/building-spark.html 2015-03-25 15:22 GMT+08:00 sandeep vura sandeepv...@gmail.com: Hi Sparkers, I am trying to load

Re: Re: How SparkStreaming output messages to Kafka?

2015-03-30 Thread Saisai Shao
Also you could use Producer singletion to improve the performance, since now you have to create a Producer for each partition in each batch duration, you could create a singleton object and reuse it (Producer is tread safe as I know). -Jerry 2015-03-30 15:13 GMT+08:00 Saisai Shao sai.sai.s

Re: Running Spark in Local Mode

2015-03-29 Thread Saisai Shao
Hi, I think for local mode, the number N (N number of thread) basically equals to N number of available cores in ONE executor(worker), not N workers. You could image local[N] as have one worker with N cores. I'm not sure you could set the memory usage for each thread, for Spark the memory is

Re: Re: How SparkStreaming output messages to Kafka?

2015-03-30 Thread Saisai Shao
() } } Thanksamp;Best regards! 罗辉 San.Luo - 原始邮件 - 发件人:Saisai Shao sai.sai.s...@gmail.com 收件人:luohui20...@sina.com 抄送人:user user@spark.apache.org 主题:Re: How SparkStreaming output messages to Kafka? 日期:2015年03月30日 14点03分 Hi Hui, Did you try the direct Kafka

Re: 回复:Re: Re: How SparkStreaming output messages to Kafka?

2015-03-30 Thread Saisai Shao
? Thanksamp;Best regards! 罗辉 San.Luo - 原始邮件 - 发件人:luohui20...@sina.com 收件人:Saisai Shao sai.sai.s...@gmail.com 抄送人:user user@spark.apache.org 主题:回复:Re: Re: How SparkStreaming output messages to Kafka? 日期:2015年03月30日 16点46分 Hi Saisai, following your advice, i

Re: why Shuffle Write is not zero when everything is cached and there is enough memory?

2015-03-30 Thread Saisai Shao
Shuffle write will finally spill the data into file system as a bunch of files. If you want to avoid disk write, you can mount a ramdisk and configure spark.local.dir to this ram disk. So shuffle output will write to memory based FS, and will not introduce disk IO. Thanks Jerry 2015-03-30 17:15

Re: Spark + Kafka

2015-04-01 Thread Saisai Shao
Would you please share your code snippet please, so we can identify is there anything wrong in your code. Beside would you please grep your driver's debug log to see if there's any debug log about Stream xxx received block xxx, this means that Spark Streaming is keeping receiving data from

Re: WordCount example

2015-03-26 Thread Saisai Shao
Hi, Did you run the word count example in Spark local mode or other mode, in local mode you have to set Local[n], where n =2. For other mode, make sure available cores larger than 1. Because the receiver inside Spark Streaming wraps as a long-running task, which will at least occupy one core.

Re: Spark-sql query got exception.Help

2015-03-25 Thread Saisai Shao
Would you mind running again to see if this exception can be reproduced again, since exception in MapOutputTracker seldom occurs, maybe some other exceptions which lead to this error. Thanks Jerry 2015-03-26 10:55 GMT+08:00 李铖 lidali...@gmail.com: One more exception.How to fix it .Anybody help

Re: Spark streaming - textFileStream/fileStream - Get file name

2015-04-28 Thread Saisai Shao
I think currently there's no API in Spark Streaming you can use to get the file names for file input streams. Actually it is not trivial to support this, may be you could file a JIRA with wishes you want the community to support, so anyone who is interested can take a crack on this. Thanks Jerry

Re: Re: Spark streaming - textFileStream/fileStream - Get file name

2015-04-28 Thread Saisai Shao
in RDD to figure out the file information where the data in RDD is from -- bit1...@163.com *From:* Saisai Shao sai.sai.s...@gmail.com *Date:* 2015-04-29 10:10 *To:* lokeshkumar lok...@dataken.net *CC:* spark users user@spark.apache.org *Subject:* Re: Spark

Re: Re: Spark streaming - textFileStream/fileStream - Get file name

2015-04-29 Thread Saisai Shao
, Apr 29, 2015 at 8:33 AM, bit1...@163.com bit1...@163.com wrote: For the SparkContext#textFile, if a directory is given as the path parameter ,then it will pick up the files in the directory, so the same thing will occur. -- bit1...@163.com *From:* Saisai Shao

Re: The Processing loading of Spark streaming on YARN is not in balance

2015-04-30 Thread Saisai Shao
From the chart you pasted, I guess you only have one receiver with storage level two copies, so mostly your taks are located on two executors. You could use repartition to redistribute the data more evenly across the executors. Also add more receiver is another solution. 2015-04-30 14:38

Re: ReduceByKey and sorting within partitions

2015-04-27 Thread Saisai Shao
Hi Marco, As I know, current combineByKey() does not expose the related argument where you could set keyOrdering on the ShuffledRDD, since ShuffledRDD is package private, if you can get the ShuffledRDD through reflection or other way, the keyOrdering you set will be pushed down to shuffle. If you

Re: Help with publishing to Kafka from Spark Streaming?

2015-05-02 Thread Saisai Shao
Here is the pull request, you may refer to this: https://github.com/apache/spark/pull/2994 Thanks Jerry 2015-05-01 14:38 GMT+08:00 Pavan Sudheendra pavan0...@gmail.com: Link to the question: http://stackoverflow.com/questions/29974017/spark-kafka-producer-not-serializable-exception

Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-04 Thread Saisai Shao
IMHO If your data or your algorithm is prone to data skew, I think you have to fix this from application level, Spark itself cannot overcome this problem (if one key has large amount of values), you may change your algorithm to choose another shuffle key, somethings like this to avoid shuffle on

Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-04 Thread Saisai Shao
talk more about shuffle key or point me to APIs that allow me to change shuffle key. I will try with different keys and see the performance. What is the shuffle key by default ? On Mon, May 4, 2015 at 2:37 PM, Saisai Shao sai.sai.s...@gmail.com wrote: IMHO If your data or your algorithm

Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-04 Thread Saisai Shao
, Saisai Shao sai.sai.s...@gmail.com wrote: Shuffle key is depending on your implementation, I'm not sure if you are familiar with MapReduce, the mapper output is a key-value pair, where the key is the shuffle key for shuffling, Spark is also the same. 2015-05-04 17:31 GMT+08:00 ÐΞ€ρ@Ҝ

Re: Using spark streaming to load data from Kafka to HDFS

2015-05-06 Thread Saisai Shao
Also Kafka has a Hadoop consumer API for doing such things, please refer to http://kafka.apache.org/081/documentation.html#kafkahadoopconsumerapi 2015-05-06 12:22 GMT+08:00 MrAsanjar . afsan...@gmail.com: why not try https://github.com/linkedin/camus - camus is kafka to HDFS pipeline On

Re: No space left on device??

2015-05-06 Thread Saisai Shao
to make a *skew* data/executor distribution? Best, Yifan LI On 06 May 2015, at 15:13, Saisai Shao sai.sai.s...@gmail.com wrote: I think it depends on your workload and executor distribution, if your workload is evenly distributed without any big data skew, and executors are evenly

Re: No space left on device??

2015-05-06 Thread Saisai Shao
I think you could configure multiple disks through spark.local.dir, default is /tmp. Anyway if your intermediate data is larger than available disk space, still will meet this issue. spark.local.dir/tmpDirectory to use for scratch space in Spark, including map output files and RDDs that get

Re: No space left on device??

2015-05-06 Thread Saisai Shao
mentioned. 2015-05-06 21:09 GMT+08:00 Yifan LI iamyifa...@gmail.com: Thanks, Shao. :-) I am wondering if the spark will rebalance the storage overhead in runtime…since still there is some available space on other nodes. Best, Yifan LI On 06 May 2015, at 14:57, Saisai Shao sai.sai.s

Re: Partition number of Spark Streaming Kafka receiver-based approach

2015-05-18 Thread Saisai Shao
HI Bill, You don't need to match the number of thread to the number of partitions in the specific topic, for example, you have 3 partitions in topic1, but you only set 2 threads, ideally 1 thread will receive 2 partitions and another thread for the left one partition, it depends on the scheduling

Re: Spark Streaming to Kafka

2015-05-19 Thread Saisai Shao
I think here is the PR https://github.com/apache/spark/pull/2994 you could refer to. 2015-05-20 13:41 GMT+08:00 twinkle sachdeva twinkle.sachd...@gmail.com: Hi, As Spark streaming is being nicely integrated with consuming messages from Kafka, so I thought of asking the forum, that is there

Re: ERROR cluster.YarnScheduler: Lost executor

2015-06-03 Thread Saisai Shao
I think you could check the yarn nodemanager log or other Spark executor logs to see the details. What you listed above of the exception stacks are just the phenomenon, not the cause. Normally there will be some situations which will lead to executor lost: 1. Killed by yarn cause of memory

Re: Reading Really Big File Stream from HDFS

2015-06-12 Thread Saisai Shao
Using sc.textFile will also read the file from HDFS one by one line through iterator, don't need to fit all into memory, even you have small size of memory, it still can be worked. 2015-06-12 13:19 GMT+08:00 SLiZn Liu sliznmail...@gmail.com: Hmm, you have a good point. So should I load the file

Re: Spark 1.4: Python API for getting Kafka offsets in direct mode?

2015-06-11 Thread Saisai Shao
. Spark does not support any state persistence across deployments so this is something we need to handle on our own. Hope that helps. Let me know if not. Thanks! Amit On Thu, Jun 11, 2015 at 10:02 PM, Saisai Shao sai.sai.s...@gmail.com wrote: Hi, What is your meaning of getting

Re: Spark 1.4: Python API for getting Kafka offsets in direct mode?

2015-06-12 Thread Saisai Shao
a...@yelp.com: Thanks, Jerry. That's what I suspected based on the code I looked at. Any pointers on what is needed to build in this support would be great. This is critical to the project we are currently working on. Thanks! On Thu, Jun 11, 2015 at 10:54 PM, Saisai Shao sai.sai.s

Re: Spark 1.4: Python API for getting Kafka offsets in direct mode?

2015-06-11 Thread Saisai Shao
Hi, What is your meaning of getting the offsets from the RDD, from my understanding, the offsetRange is a parameter you offered to KafkaRDD, why do you still want to get the one previous you set into? Thanks Jerry 2015-06-12 12:36 GMT+08:00 Amit Ramesh a...@yelp.com: Congratulations on the

Re: Spark Configuration of spark.worker.cleanup.appDataTtl

2015-06-16 Thread Saisai Shao
I think you have to using 604800 instead of 7 * 24 * 3600, obviously SparkConf will not do multiplication for you.. The exception is quite obvious: Caused by: java.lang.NumberFormatException: For input string: 3 * 24 * 3600 2015-06-16 14:52 GMT+08:00 luohui20...@sina.com: Hi guys: I

Re: Is the executor number fixed during the lifetime of one app ?

2015-05-27 Thread Saisai Shao
It depends on how you use Spark, if you use Spark with Yarn and enable dynamic allocation, the number of executor is not fixed, will change dynamically according to the load. Thanks Jerry 2015-05-27 14:44 GMT+08:00 canan chen ccn...@gmail.com: It seems the executor number is fixed for the

Re: Is the executor number fixed during the lifetime of one app ?

2015-05-27 Thread Saisai Shao
works ? I mean does it related with parallelism of my RDD and how does driver know how many executor it needs ? On Wed, May 27, 2015 at 2:49 PM, Saisai Shao sai.sai.s...@gmail.com wrote: It depends on how you use Spark, if you use Spark with Yarn and enable dynamic allocation, the number

Re: Is the executor number fixed during the lifetime of one app ?

2015-05-27 Thread Saisai Shao
a way to do it in Yarn. On Wednesday, May 27, 2015, Saisai Shao sai.sai.s...@gmail.com wrote: The drive has a heuristic mechanism to decide the number of executors in the run-time according the pending tasks. You could enable with configuration, you could refer to spark document to find

Re: Total delay per batch in a CSV file

2015-08-05 Thread Saisai Shao
Hi, Lots of streaming internal status are exposed through StreamingListener, as well as what see from web UI, so you could write your own StreamingListener and register in StreamingContext to get the internal information of Spark Streaming and write to CSV file. You could check the source code

Re: Spark Streaming - CheckPointing issue

2015-08-05 Thread Saisai Shao
Hi, What Spark version do you use? it looks like a problem of configuration recovery, not sure is it a twitter streaming specific problem, I tried Kafka streaming with checkpoint enabled in my local machine, seems no such issue. Did you try to set these configurations in somewhere? Thanks Saisai

Re: Newbie question: can shuffle avoid writing and reading from disk?

2015-08-05 Thread Saisai Shao
Hi Muler, Shuffle data will be written to disk, no matter how large memory you have, large memory could alleviate shuffle spill where temporary file will be generated if memory is not enough. Yes, each node writes shuffle data to file and pulled from disk in reduce stage from network framework

Re: Newbie question: can shuffle avoid writing and reading from disk?

2015-08-05 Thread Saisai Shao
) then shuffle (in-memory shuffle) spill doesn't happen (per node) but still shuffle data has to be ultimately written to disk so that reduce stage pulls if across network? On Wed, Aug 5, 2015 at 4:40 PM, Saisai Shao sai.sai.s...@gmail.com wrote: Hi Muler, Shuffle data will be written to disk

Re: How to restart a failed Spark Streaming Application automatically in client mode on YARN

2015-10-22 Thread Saisai Shao
Looks like currently there's no way for Spark Streaming to restart automatically in yarn-client mode, because in yarn-client mode, AM and driver are two processes, Yarn only control the restart of AM, not driver, so it is not supported in yarn-client mode. You can write some scripts to monitor

Re: Spark_1.5.1_on_HortonWorks

2015-10-21 Thread Saisai Shao
.main(HistoryServer.scala) > > > I went to the lib folder and noticed that > "spark-assembly-1.5.1-hadoop2.6.0.jar" is missing that class. I was able to > get the spark history server started with 1.3.1 but not 1.5.1. Any inputs > on this? > > Really appreciate your help

Re: Spark_1.5.1_on_HortonWorks

2015-10-21 Thread Saisai Shao
of time, do I have to manually go and copy > spark-1.5.1 tarbal to all the nodes or is there any alternative so that I > can get it upgraded through Ambari UI ? If possible can anyone point me to > a documentation online? Thank you. > > Regards, > Ajay > > > On Wednesday, October

Re: Spark_1.5.1_on_HortonWorks

2015-10-21 Thread Saisai Shao
http://www.meruvian.org > > "We grow because we share the same belief." > > > On Thu, Oct 22, 2015 at 8:56 AM, Saisai Shao <sai.sai.s...@gmail.com> > wrote: > > How you start history server, do you still use the history server of > 1.3.1, > > or y

Re: YARN Labels

2015-11-16 Thread Saisai Shao
Node label for AM is not yet supported for Spark now, currently only executor is supported. On Tue, Nov 17, 2015 at 7:57 AM, Ted Yu wrote: > Wangda, YARN committer, told me that support for selecting which nodes the > application master is running on is integrated to the

Re: How to enable MetricsServlet sink in Spark 1.5.0?

2015-11-16 Thread Saisai Shao
it should worked. I tested in my local environment with "curl http://localhost:4040/metrics/json/;, there's metrics dumped. For cluster metrics, you have to change your base url to point to cluster manager. Thanks Jerry On Mon, Nov 16, 2015 at 5:42 PM, ihavethepotential <

Re: dynamic allocation w/ spark streaming on mesos?

2015-11-11 Thread Saisai Shao
I think for receiver-less Streaming connectors like direct Kafka input stream or hdfs connector, dynamic allocation could be worked compared to other receiver-based streaming connectors, since for receiver-less connectors, the behavior of streaming app is more like a normal Spark app, so dynamic

Re: dynamic allocation w/ spark streaming on mesos?

2015-11-11 Thread Saisai Shao
e micro-batch jobs are likely to use all the > executors all the time, and no executor will remain idle for long. That is > why the heuristic doesnt work that well. > > > On Wed, Nov 11, 2015 at 6:32 PM, Saisai Shao <sai.sai.s...@gmail.com> > wrote: > >> I think for r

Re: Python Kafka support?

2015-11-10 Thread Saisai Shao
Hi Darren, Functionality like messageHandler is missing in python API, still not included in version 1.5.1. Thanks Jerry On Wed, Nov 11, 2015 at 7:37 AM, Darren Govoni wrote: > Hi, > I read on this page >

Re: [Yarn] Executor cores isolation

2015-11-10 Thread Saisai Shao
>From my understanding, it depends on whether you enabled CGroup isolation or not in Yarn. By default it is not, which means you could allocate one core but bump a lot of thread in your task to occupy the CPU resource, this is just a logic limitation. For Yarn CPU isolation you may refer to this

Re: Exception while reading from kafka stream

2015-10-30 Thread Saisai Shao
What Spark version are you using, also a small code snippet of how you use Spark Streaming would be greatly helpful. On Fri, Oct 30, 2015 at 3:57 PM, Ramkumar V wrote: > I can able to read and print few lines. Afterthat i'm getting this > exception. Any idea for this ?

Re: Exception while reading from kafka stream

2015-10-30 Thread Saisai Shao
Set >); > > messages.foreachRDD(new Function<JavaPairRDD,Void> () { > public Void call(JavaPairRDD tuple) { > JavaRDDrdd = tuple.values(); > rdd.saveAsTextFile("hdfs://myuser:8020/user/hdfs/output"); > re

Re: Does the Standalone cluster and Applications need to be same Spark version?

2015-11-03 Thread Saisai Shao
I think it can be worked unless you use some new APIs that only exists in 1.5.1 release (mostly this will not happened). You'd better take a try to see if it can be run or not. On Tue, Nov 3, 2015 at 10:11 AM, pnpritchard < nicholas.pritch...@falkonry.com> wrote: > The title gives the gist of

Re: Exception while reading from kafka stream

2015-10-30 Thread Saisai Shao
unstable. On Fri, Oct 30, 2015 at 5:13 PM, Ramkumar V <ramkumar.c...@gmail.com> wrote: > No, i dont have any special settings. if i keep only reading line in my > code, it's throwing NPE. > > *Thanks*, > <https://in.linkedin.com/in/ramkumarcs31> > > > On Fri,

Re: Exception while reading from kafka stream

2015-10-30 Thread Saisai Shao
ing.api.java.JavaStreamingContext.start(JavaStreamingContext.scala:622) > > > *Thanks*, > <https://in.linkedin.com/in/ramkumarcs31> > > > On Fri, Oct 30, 2015 at 3:25 PM, Saisai Shao <sai.sai.s...@gmail.com > <javascript:_e(%7B%7D,'cvml','sai.sai.s...@gmail.com');>> w

Re: Exception while reading from kafka stream

2015-10-30 Thread Saisai Shao
ks*, > <https://in.linkedin.com/in/ramkumarcs31> > > > On Fri, Oct 30, 2015 at 2:50 PM, Saisai Shao <sai.sai.s...@gmail.com> > wrote: > >> I just did a local test with your code, seems everything is fine, the >> only difference is that I use the master branch, but I don'

Re: How to unpersist a DStream in Spark Streaming

2015-11-04 Thread Saisai Shao
Hi Swetha, Would you mind elaborating your usage scenario of DStream unpersisting? >From my understanding: 1. Spark Streaming will automatically unpersist outdated data (you already mentioned about the configurations). 2. If streaming job is started, I think you may lose the control of the job,

Re: an problem about zippartition

2015-10-13 Thread Saisai Shao
maybe you could try "localCheckpoint" insteadly. 2015年10月14日星期三,张仪yf1 <zhangyi...@hikvision.com> 写道: > Thank you for your reply. It helped a lot. But when the data became > bigger, the action cost more, is there any optimizer > > > > *发件人:* Saisai S

Re: Node afinity for Kafka-Direct Stream

2015-10-14 Thread Saisai Shao
You could check the code of KafkaRDD, the locality (host) is got from Kafka's partition and set in KafkaRDD, this will a hint for Spark to schedule task on the preferred location. override def getPreferredLocations(thePart: Partition): Seq[String] = { val part =

Re: Node afinity for Kafka-Direct Stream

2015-10-14 Thread Saisai Shao
This preferred locality is a hint to spark to schedule Kafka tasks on the preferred nodes, if Kafka and Spark are two separate cluster, obviously this locality hint takes no effect, and spark will schedule tasks following node-local -> rack-local -> any pattern, like any other spark tasks. On

Re: Spark_1.5.1_on_HortonWorks

2015-10-20 Thread Saisai Shao
Hi Frans, You could download Spark 1.5.1-hadoop 2.6 pre-built tarball and copy into HDP 2.3 sandbox or master node. Then copy all the conf files from /usr/hdp/current/spark-client/ to your /conf, or you could refer to this tech preview (

Re: localhost webui port

2015-10-13 Thread Saisai Shao
By configuring "spark.ui.port" to the port you could bind. On Tue, Oct 13, 2015 at 8:47 PM, Langston, Jim wrote: > Hi all, > > Is there anyway to change the default port 4040 for the localhost webUI, > unfortunately, that port is blocked and I have no control of

Re: an problem about zippartition

2015-10-13 Thread Saisai Shao
You have to call the checkpoint regularly on rdd0 to cut the dependency chain, otherwise you will meet such problem as you mentioned, even stack overflow finally. This is a classic problem for high iterative job, you could google it for the fix solution. On Tue, Oct 13, 2015 at 7:09 PM, 张仪yf1

Re: Where is the doc about the spark rest api ?

2015-08-31 Thread Saisai Shao
Here is the Rest related part in Spark ( https://github.com/apache/spark/tree/master/core/src/main/scala/org/apache/spark/deploy/rest ), current I don't think there's a document address this part, also this rest api is only used for SparkSubmit currently, not public API as I know. Thanks Jerry

Re: Too many open files issue

2015-09-02 Thread Saisai Shao
Here is the code in which NewHadoopRDD register close handler and be called when the task is completed ( https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L136 ). >From my understanding, possibly the reason is that this `foreach` code in your

Re: repartition on direct kafka stream

2015-09-03 Thread Saisai Shao
Yes not the offset ranges, but the real data will be shuffled when you using repartition(). Thanks Saisai On Fri, Sep 4, 2015 at 12:42 PM, Shushant Arora wrote: > 1.Does repartitioning on direct kafka stream shuffles only the offsets or > exact kafka messages across

Re: about mr-style merge sort

2015-09-10 Thread Saisai Shao
Hi Qianhao, I think you could sort the data by yourself if you want achieve the same result as MR, like rdd.reduceByKey(...).mapPartitions(// sort within each partition). Do not call sortByKey again since it will introduce another shuffle (that's the reason why it is slower than MR). The

Re: Spark Streaming kafka directStream value decoder issue

2015-09-17 Thread Saisai Shao
Is your "KafkaGenericEvent" serializable? Since you call rdd.collect() to fetch the data to local driver, so this KafkaGenericEvent need to be serialized and deserialized through Java or Kryo (depends on your configuration) serializer, not sure if it is your problem to always get a default object.

Re: Spark w/YARN Scheduling Questions...

2015-09-17 Thread Saisai Shao
Task set is a set of tasks within one stage. Executor will be killed when it is idle for a period of time (default is 60s). The problem you mentioned is bug, scheduler should not allocate tasks on this to-be killed executors. I think it is fixed in 1.5. Thanks Saisai On Thu, Sep 17, 2015 at

Re: Submitting with --deploy-mode cluster: uploading the jar

2015-09-30 Thread Saisai Shao
As I remembered you don't need to upload application jar manually, Spark will do it for you when you use Spark submit. Would you mind posting out your command of Spark submit? On Wed, Sep 30, 2015 at 3:13 PM, Christophe Schmitz wrote: > Hi there, > > I am trying to use the

Re: Submitting with --deploy-mode cluster: uploading the jar

2015-09-30 Thread Saisai Shao
gt; > Christophe > > > On Wed, Sep 30, 2015 at 5:19 PM, Saisai Shao <sai.sai.s...@gmail.com> > wrote: > >> As I remembered you don't need to upload application jar manually, Spark >> will do it for you when you use Spark submit. Would you mind posting out >> yo

Re: Spark Streaming checkpoint recovery throws Stack Overflow Error

2015-09-18 Thread Saisai Shao
Hi Swetha, The problem of stack overflow is that when recovering from checkpoint data, Java will use a recursive way to deserialize the call stack, if you have a large call stack, this recursive way can easily lead to stack overflow. This is caused by Java deserialization mechanism, you need to

Re: Py4j issue with Python Kafka Module

2015-09-22 Thread Saisai Shao
k-streaming-kafka-assembly_2.10-1.5.0' > >>> > > > So I launched pyspark with --jars with the assembly jar. Now it is > working. > > THANK YOU for help. > > Curiosity: Why adding it to SPARK CLASSPATH did not work? > > Best > Ayan > > On Wed, Se

Re: Py4j issue with Python Kafka Module

2015-09-22 Thread Saisai Shao
I think you're using the wrong version of kafka assembly jar, I think Python API from direct Kafka stream is not supported for Spark 1.3.0, you'd better change to version 1.5.0, looks like you're using Spark 1.5.0, why you choose Kafka assembly 1.3.0?

Re: Spark on Yarn vs Standalone

2015-09-21 Thread Saisai Shao
I think you need to increase the memory size of executor through command arguments "--executor-memory", or configuration "spark.executor.memory". Also yarn.scheduler.maximum-allocation-mb in Yarn side if necessary. Thanks Saisai On Mon, Sep 21, 2015 at 5:13 PM, Alexander Pivovarov

Re: set up spark 1.4.1 as default spark engine in HDP 2.2/2.3

2015-12-08 Thread Saisai Shao
Please make sure the spark shell script you're running is pointed to /bin/spark-shell Just follow the instructions to correctly configure your spark 1.4.1 and execute correct script are enough. On Wed, Dec 9, 2015 at 11:28 AM, Divya Gehlot wrote: > Hi, > As per

Re: Spark Stream Monitoring with Kafka Direct API

2015-12-09 Thread Saisai Shao
I think this is the right JIRA to fix this issue ( https://issues.apache.org/jira/browse/SPARK-7111). It should be in Spark 1.4. On Thu, Dec 10, 2015 at 12:32 AM, Cody Koeninger wrote: > Looks like probably > > https://issues.apache.org/jira/browse/SPARK-8701 > > so 1.5.0 >

Re: Spark Streaming - Number of RDDs in Dstream

2015-12-20 Thread Saisai Shao
Normally there will be one RDD in each batch. You could refer to the implementation of DStream#getOrCompute. On Mon, Dec 21, 2015 at 11:04 AM, Arun Patel wrote: > It may be simple question...But, I am struggling to understand this > > DStream is a sequence of RDDs

Re: Job Error:Actor not found for: ActorSelection[Anchor(akka.tcp://sparkDriver@130.1.10.108:23600/)

2015-12-25 Thread Saisai Shao
I think SparkContext is thread-safe, you could concurrently submit jobs from different threads, the problem you hit might not relate to this. Can you reproduce this issue each time when you concurrently submit jobs, or is it happened occasionally? BTW, I guess you're using the old version of

Re: Job Error:Actor not found for: ActorSelection[Anchor(akka.tcp://sparkDriver@130.1.10.108:23600/)

2015-12-25 Thread Saisai Shao
might be one potential cause, you'd better increase the vm resource to try again, just to verify your assumption. On Fri, Dec 25, 2015 at 4:28 PM, donhoff_h <165612...@qq.com> wrote: > Hi, Saisai Shao > > Many thanks for your reply. I used spark v1.3. Unfortunately I can not > chang

Re: Opening Dynamic Scaling Executors on Yarn

2015-12-27 Thread Saisai Shao
ark-1.6.0 on one yarn > cluster? > > > > *From:* Saisai Shao [mailto:sai.sai.s...@gmail.com] > *Sent:* Monday, December 28, 2015 2:29 PM > *To:* Jeff Zhang > *Cc:* 顾亮亮; user@spark.apache.org; 刘骋昺 > *Subject:* Re: Opening Dynamic Scaling Executors on Yarn > > &g

Re: Opening Dynamic Scaling Executors on Yarn

2015-12-27 Thread Saisai Shao
Replace all the shuffle jars and restart the NodeManager is enough, no need to restart NN. On Mon, Dec 28, 2015 at 2:05 PM, Jeff Zhang wrote: > See > http://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation > > > > On Mon, Dec 28, 2015 at 2:00 PM,

Re: Problem About Worker System.out

2015-12-28 Thread Saisai Shao
Stdout will not be sent back to driver, no matter you use Scala or Java. You must do something wrongly that makes you think it is an expected behavior. On Mon, Dec 28, 2015 at 5:33 PM, David John wrote: > I have used Spark *1.4* for 6 months. Thanks all the

Re: Spark Streaming - Number of RDDs in Dstream

2015-12-21 Thread Saisai Shao
Yes, basically from the currently implementation it should be. On Mon, Dec 21, 2015 at 6:39 PM, Arun Patel <arunp.bigd...@gmail.com> wrote: > So, Does that mean only one RDD is created by all receivers? > > > > On Sun, Dec 20, 2015 at 10:23 PM, Saisai Shao <sai.sai

Re: spark-submit is ignoring "--executor-cores"

2015-12-21 Thread Saisai Shao
Hi Siva, How did you know that --executor-cores is ignored and where did you see that only 1 Vcore is allocated? Thanks Saisai On Tue, Dec 22, 2015 at 9:08 AM, Siva wrote: > Hi Everyone, > > Observing a strange problem while submitting spark streaming job in >

Re: spark-submit is ignoring "--executor-cores"

2015-12-21 Thread Saisai Shao
on web UI. > > Thanks, > Sivakumar Bhavanari. > > On Mon, Dec 21, 2015 at 5:21 PM, Saisai Shao <sai.sai.s...@gmail.com> > wrote: > >> Hi Siva, >> >> How did you know that --executor-cores is ignored and where did you see >> that only 1 Vcore is alloc

Re: Can't run spark on yarn

2015-12-17 Thread Saisai Shao
Please check the Yarn AM log to see why AM is failed to start. That's the reason why using `sc` will get such complaint. On Fri, Dec 18, 2015 at 4:25 AM, Eran Witkon wrote: > Hi, > I am trying to install spark 1.5.2 on Apache hadoop 2.6 and Hive and yarn > > spark-env.sh >

Re: A Problem About Running Spark 1.5 on YARN with Dynamic Alloction

2015-11-23 Thread Saisai Shao
Hi Tingwen, Would you minding sharing your changes in ExecutorAllocationManager#addExecutors(). >From my understanding and test, dynamic allocation can be worked when you set the min to max number of executors to the same number. Please check your Spark and Yarn log to make sure the executors

Re: A Problem About Running Spark 1.5 on YARN with Dynamic Alloction

2015-11-23 Thread Saisai Shao
at the Spark application did > not requested resource from it. > > Is this a bug? Should I create a JIRA for this problem? > > 2015-11-24 12:00 GMT+08:00 Saisai Shao <sai.sai.s...@gmail.com>: > >> OK, so this looks like your Yarn cluster does not allocate containers >>

Re: A Problem About Running Spark 1.5 on YARN with Dynamic Alloction

2015-11-23 Thread Saisai Shao
r: Not adding executors > because our current target total is already 50 (limit 50)". > Thanks > Weber > > 2015-11-23 21:00 GMT+08:00 Saisai Shao <sai.sai.s...@gmail.com>: > >> Hi Tingwen, >> >> Would you minding sharing your changes in >> ExecutorAllocatio

Re: Re: A Problem About Running Spark 1.5 on YARN with Dynamic Alloction

2015-11-24 Thread Saisai Shao
oadcast 0 from >>> broadcast at DAGScheduler.scala:861 >>> >>> 15/11/24 16:16:30 INFO scheduler.DAGScheduler: Submitting 200 missing tasks >>> from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:32) >>> >>> 15/11/24 16:16:30 INFO cl

Re: Re: A Problem About Running Spark 1.5 on YARN with Dynamic Alloction

2015-11-24 Thread Saisai Shao
quot; equals 50, but in > http://spark.apache.org/docs/latest/configuration.html#dynamic-allocation > it says that > > "spark.dynamicAllocation.initialExecutors" equals " > spark.dynamicAllocation.minExecutors". So, I think something was wrong, > did it? > > Tha

Re: Map tuple to case class in Dataset

2016-05-31 Thread Saisai Shao
It works fine in my local test, I'm using latest master, maybe this bug is already fixed. On Wed, Jun 1, 2016 at 7:29 AM, Michael Armbrust wrote: > Version of Spark? What is the exception? > > On Tue, May 31, 2016 at 4:17 PM, Tim Gautier > wrote:

Re: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher

2016-06-22 Thread Saisai Shao
spark.yarn.jar (none) The location of the Spark jar file, in case overriding the default location is desired. By default, Spark on YARN will use a Spark jar installed locally, but the Spark jar can also be in a world-readable location on HDFS. This allows YARN to cache it on nodes so that it

YARN Application Timeline service with Spark 2.0.0 issue

2016-06-17 Thread Saisai Shao
Hi Community, In Spark 2.0.0 we upgrade to use jersey2 ( https://issues.apache.org/jira/browse/SPARK-12154) instead of jersey 1.9, while for the whole Hadoop we still stick on the old version. This will bring in some issues when yarn timeline service is enabled (

Re: problem running spark with yarn-client not using spark-submit

2016-06-26 Thread Saisai Shao
It means several jars are missing in the yarn container environment, if you want to submit your application through some other ways besides spark-submit, you have to take care all the environment things yourself. Since we don't know your implementation of java web service, so it is hard to provide

Re: streaming textFileStream problem - got only ONE line

2016-01-26 Thread Saisai Shao
Any possibility that this file is still written by other application, so what Spark Streaming processed is an incomplete file. On Tue, Jan 26, 2016 at 5:30 AM, Shixiong(Ryan) Zhu wrote: > Did you move the file into "hdfs://helmhdfs/user/patcharee/cerdata/", or > write

Re: How data locality is honored when spark is running on yarn

2016-01-27 Thread Saisai Shao
Hi Todd, There're two levels of locality based scheduling when you run Spark on Yarn if dynamic allocation enabled: 1. Container allocation is based on the locality ratio of pending tasks, this is Yarn specific and only works with dynamic allocation enabled. 2. Task scheduling is locality

  1   2   >