Re: How to give multiple directories as input ?

2015-05-27 Thread ayan guha
What about /blah/*/blah/out*.avro? On 27 May 2015 18:08, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I am doing that now. Is there no other way ? On Wed, May 27, 2015 at 12:40 PM, Akhil Das ak...@sigmoidanalytics.com wrote: How about creating two and union [ sc.union(first, second) ] them?

Re: How to give multiple directories as input ?

2015-05-27 Thread Eugen Cepoi
Try something like that: def readGenericRecords(sc: SparkContext, inputDir: String, startDate: Date, endDate: Date) = { // assuming a list of paths val paths: Seq[String] = getInputPaths(inputDir, startDate, endDate) val job = Job.getInstance(new

Re: How many executors can I acquire in standalone mode ?

2015-05-27 Thread ayan guha
You can request number of cores and amount of memory for each executor. On 27 May 2015 18:25, canan chen ccn...@gmail.com wrote: Thanks Arush. My scenario is that In standalone mode, if I have one worker, when I start spark-shell, there will be one executor launched. But if I have 2 workers,

Re: How to give multiple directories as input ?

2015-05-27 Thread ๏̯͡๏
def readGenericRecords(sc: SparkContext, inputDir: String, startDate: Date, endDate: Date) = { val path = getInputPaths(inputDir, startDate, endDate) sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable, AvroKeyInputFormat[GenericRecord]](/A/B/C/D/D/2015/05/22/out-r-*.avro) }

Re: How many executors can I acquire in standalone mode ?

2015-05-27 Thread canan chen
Thanks Arush. My scenario is that In standalone mode, if I have one worker, when I start spark-shell, there will be one executor launched. But if I have 2 workers, there will be 2 executors launched, so I am wondering the mechanism of executor allocation. Is it possible to specify how many

Re: DataFrame. Conditional aggregation

2015-05-27 Thread ayan guha
Till 1.3, you have to prepare the DF appropriately def setupCondition(t): if t[1] 100: v = 1 else: v = 0 return Row(col1=t[0],col2=t[1],col3=t[2],col4=v) d1=[[1001,100,50],[1001,200,100],[1002,100,99]] d1RDD = sc.parallelize(d1).map(setupCondition) d1DF =

Avro CombineInputFormat ?

2015-05-27 Thread ๏̯͡๏
Can someone share me some code to use CombineInputFormat to read avro data. Today I use def readGenericRecords(sc: SparkContext, inputDir: String, startDate: Date, endDate: Date) = { val path = getInputPaths(inputDir, startDate, endDate) sc.newAPIHadoopFile[AvroKey[GenericRecord],

回复:Re: Re: Re: Re: how to distributed run a bash shell in spark

2015-05-27 Thread luohui20001
Hi Akhil and all My previous code has some problems,all the executors are looping and running the same command. That's not what I am expecting.previous code: val shellcompare = List(run,sort.sh) val shellcompareRDD = sc.makeRDD(shellcompare) val result = List(aggregate,result)

Re: How to give multiple directories as input ?

2015-05-27 Thread Eugen Cepoi
You can do that using FileInputFormat.addInputPath 2015-05-27 10:41 GMT+02:00 ayan guha guha.a...@gmail.com: What about /blah/*/blah/out*.avro? On 27 May 2015 18:08, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I am doing that now. Is there no other way ? On Wed, May 27, 2015 at 12:40 PM,

Re: View all user's application logs in history server

2015-05-27 Thread Jianshi Huang
No one using History server? :) Am I the only one need to see all user's logs? Jianshi On Thu, May 21, 2015 at 1:29 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, I'm using Spark 1.4.0-rc1 and I'm using default settings for history server. But I can only see my own logs. Is it

[POWERED BY] Please add our organization

2015-05-27 Thread Antonio Giambanco
Name: Objectway.com URL: www.objectway.com Description: We're building a Big Data solution on Spark. We use Apache Flume for parallel message queuing infrastructure and Apache Spark Streaming for near real time datastream processing combined with a rule engine for complex events catching.

RE: How does spark manage the memory of executor with multiple tasks

2015-05-27 Thread java8964
Same as you, there are lots of people coming from MapReduce world, and try to understand the internals of Spark. Hope below can help you some way. For the end users, they only have concept of Job. I want to run a word count job from this one big file, that is the job I want to run. How many

Re: Spark and Stanford CoreNLP

2015-05-27 Thread mathewvinoj
Evan, could you please look into this post.Below is the link.Any thoughts or suggestion is really appreciated http://apache-spark-user-list.1001560.n3.nabble.com/Spark-partition-issue-with-Stanford-NLP-td23048.html -- View this message in context:

Re: Intermittent difficulties for Worker to contact Master on same machine in standalone

2015-05-27 Thread Stephen Boesch
Thanks Yana, My current experience here is after running some small spark-submit based tests the Master once again stopped being reachable. No change in the test setup. I restarted Master/Worker and still not reachable. What might be the variables here in which association with the

Invoking Hive UDF programmatically

2015-05-27 Thread Punyashloka Biswal
Dear Spark users, Given a DataFrame df with a column named foo bar, I can call a Spark SQL built-in function on it like so: df.select(functions.max(df(foo bar))) However, if I want to apply a Hive UDF named myCustomFunction, I need to write df.selectExpr(myCustomFunction(`foo bar`)) which

Re: Question about Serialization in Storage Level

2015-05-27 Thread Imran Rashid
Hi Zhipeng, yes, your understanding is correct. the SER portion just refers to how its stored in memory. On disk, the data always has to be serialized. On Fri, May 22, 2015 at 10:40 PM, Jiang, Zhipeng zhipeng.ji...@intel.com wrote: Hi Todd, Howard, Thanks for your reply, I might not

Re: View all user's application logs in history server

2015-05-27 Thread Marcelo Vanzin
You may be the only one not seeing all the logs. Are you sure all the users are writing to the same log directory? The HS can only read from a single log directory. On Wed, May 27, 2015 at 5:33 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: No one using History server? :) Am I the only one

Re: View all user's application logs in history server

2015-05-27 Thread Jianshi Huang
Yes, all written to the same directory on HDFS. Jianshi On Wed, May 27, 2015 at 11:57 PM, Marcelo Vanzin van...@cloudera.com wrote: You may be the only one not seeing all the logs. Are you sure all the users are writing to the same log directory? The HS can only read from a single log

Re: View all user's application logs in history server

2015-05-27 Thread Marcelo Vanzin
Then: - Are all files readable by the user running the history server? - Did all applications call sc.stop() correctly (i.e. files do not have the .inprogress suffix)? Other than that, always look at the logs first, looking for any errors that may be thrown. On Wed, May 27, 2015 at 9:10 AM,

Re: hive external metastore connection timeout

2015-05-27 Thread Yana Kadiyska
I have not run into this particular issue but I'm not using latest bits in production. However, testing your theory should be easy -- MySQL is just a database, so you should be able to use a regular mysql client and see how many connections are active. You can then compare to the maximum allowed

Multilabel classification using logistic regression

2015-05-27 Thread peterg
Hi all I believe I have created a multi-label classifier using LogisticRegression but there is one snag. No matter what features I use to get the prediction, it will always return the label. I feel like I need to set a threshold but can't seem to figure out how to do that. I attached the code

Re: Adding columns to DataFrame

2015-05-27 Thread Masf
Hi. I think that it's possible to do: *df.select($*, lit(null).as(col17, lit(null).as(col18, lit(null).as(col19,, lit(null).as(col26)* Any other advice? Miguel. On Wed, May 27, 2015 at 5:02 PM, Masf masfwo...@gmail.com wrote: Hi. I have a DataFrame with 16 columns (df1) and

Spark and logging

2015-05-27 Thread dgoldenberg
I'm wondering how logging works in Spark. I see that there's the log4j.properties.template file in the conf directory. Safe to assume Spark is using log4j 1? What's the approach if we're using log4j 2? I've got a log4j2.xml file in the job jar which seems to be working for my log statements

How to get the best performance with LogisticRegressionWithSGD?

2015-05-27 Thread mélanie gallois
I'm new to Spark and I'm getting bad performance with classification methods on Spark MLlib (worse than R in terms of AUC). I am trying to put my own parameters rather than the default parameters. Here is the method I want to use : train(RDD

How to get the best performance with LogisticRegressionWithSGD?

2015-05-27 Thread SparknewUser
I'm new to Spark and I'm getting bad performance with classification methods on Spark MLlib (worse than R in terms of AUC). I am trying to put my own parameters rather than the default parameters. Here is the method I want to use : train(RDDLabeledPoint input, int numIterations,

Adding columns to DataFrame

2015-05-27 Thread Masf
Hi. I have a DataFrame with 16 columns (df1) and another with 26 columns(df2). I want to do a UnionAll. So, I want to add 10 columns to df1 in order to have the same number of columns in both dataframes. Is there some alternative to withColumn? Thanks -- Regards. Miguel Ángel

Re: Spark and logging

2015-05-27 Thread Imran Rashid
only an answer to one of your questions: What about log statements in the partition processing functions? Will their log statements get logged into a file residing on a given 'slave' machine, or will Spark capture this log output and divert it into the log file of the driver's machine?

Where does partitioning and data loading happen?

2015-05-27 Thread Stephen Carman
A colleague and I were having a discussion and we were disagreeing about something in Spark/Mesos that perhaps someone can shed some light into. We have a mesos cluster that runs spark via a sparkHome, rather than downloading an executable and such. My colleague says that say we have parquet

hive external metastore connection timeout

2015-05-27 Thread jamborta
Hi all, I am setting up a spark standalone server with an external hive metastore (using mysql), there is an issue that after 5 minutes inactivity, if I try to reconnect to the metastore (i.e. by executing a new query), it hangs for about 10 mins then times out. My guess is that datanucleus does

Re: Spark partition issue with Stanford NLP

2015-05-27 Thread vishalvibhandik
i am facing a similar issue. when the job runs with partitions num of cores then sometimes the executors are getting lost and the job doesnt complete. is there any additional logging that can be turned on to see the exact cause of this issue? -- View this message in context:

Re: Spark partition issue with Stanford NLP

2015-05-27 Thread mathewvinoj
Hi There, I am trying to process millions of data with spark/scala integrated with stanford NLP (3.4.1). Since I am using social media data I have to use NLP for the themes generation (pos tagging) and Sentiment calulation. I have to deal with Twitter data and NON Twitter data separately.So I

RDD boundaries and triggering processing using tags in the data

2015-05-27 Thread David Webber
Hi All, I'm new to Spark and I'd like some help understanding if a particular use case would be a good fit for Spark Streaming. I have an imaginary stream of sensor data consisting of integers 1-10. Every time the sensor reads 10 I'd like to average all the numbers that were received since the

Re: Multilabel classification using logistic regression

2015-05-27 Thread Joseph Bradley
It looks like you are training each model i (for label i) by only using data with label i. You need to use all of your data to train each model so the models can compare each label i with the other labels (roughly speaking). However, what you're doing is multiclass (not multilabel)

Re: debug jsonRDD problem?

2015-05-27 Thread Ted Yu
Can you tell us a bit more about (schema of) your JSON ? You can find sample JSON in sql/core/src/test//scala/org/apache/spark/sql/json/TestJsonData.scala Cheers On Wed, May 27, 2015 at 12:33 PM, Michael Stone mst...@mathom.us wrote: Can anyone provide some suggestions on how to debug this?

Re: Intermittent difficulties for Worker to contact Master on same machine in standalone

2015-05-27 Thread Stephen Boesch
Here is example after git clone-ing latest 1.4.0-SNAPSHOT. The first 3 runs (FINISHED) were successful and connected quickly. Fourth run (ALIVE) is failing on connection/association. URL: spark://mellyrn.local:7077 REST URL: spark://mellyrn.local:6066 (cluster mode) Workers: 1 Cores: 8 Total,

debug jsonRDD problem?

2015-05-27 Thread Michael Stone
Can anyone provide some suggestions on how to debug this? Using spark 1.3.1. The json itself seems to be valid (other programs can parse it) and the problem seems to lie in jsonRDD trying to describe use a schema. scala sqlContext.jsonRDD(rdd).count() java.util.NoSuchElementException:

Re: debug jsonRDD problem?

2015-05-27 Thread Ted Yu
Looks like the exception was caused by resolved.get(prefix ++ a) returning None : a = StructField(a.head, resolved.get(prefix ++ a).get, nullable = true) There are three occurrences of resolved.get() in createSchema() - None should be better handled in these places. My two cents. On

Re: How to get the best performance with LogisticRegressionWithSGD?

2015-05-27 Thread Joseph Bradley
The model is learned using an iterative convex optimization algorithm. numIterations, stepSize and miniBatchFraction are for those; you can see details here: http://spark.apache.org/docs/latest/mllib-linear-methods.html#implementation-developer

Re: debug jsonRDD problem?

2015-05-27 Thread Michael Stone
On Wed, May 27, 2015 at 01:13:43PM -0700, Ted Yu wrote: Can you tell us a bit more about (schema of) your JSON ? It's fairly simple, consisting of 22 fields with values that are mostly strings or integers, except that some of the fields are objects with http header/value pairs. I'd guess it's

SF / East Bay Area Stream Processing Meetup next Thursday (6/4)

2015-05-27 Thread Siva Jagadeesan
http://www.meetup.com/Bay-Area-Stream-Processing/events/219086133/ Thursday, June 4, 2015 6:45 PM TubeMogul http://maps.google.com/maps?f=qhl=enq=1250+53rd%2C+Emeryville%2C+CA%2C+94608%2C+us 1250 53rd St #1 Emeryville, CA 6:45PM to 7:00PM - Socializing 7:00PM to 8:00PM - Talks 8:00PM to

Re: Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-05-27 Thread Akhil Das
After submitting the job, if you do a ps aux | grep spark-submit then you can see all JVM params. Are you using the highlevel consumer (receiver based) for receiving data from Kafka? In that case if your throughput is high and the processing delay exceeds batch interval then you will hit this

Re: How to give multiple directories as input ?

2015-05-27 Thread Akhil Das
How about creating two and union [ sc.union(first, second) ] them? Thanks Best Regards On Wed, May 27, 2015 at 11:51 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I have this piece sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable, AvroKeyInputFormat[GenericRecord]](

Re: Is the executor number fixed during the lifetime of one app ?

2015-05-27 Thread Saisai Shao
It depends on how you use Spark, if you use Spark with Yarn and enable dynamic allocation, the number of executor is not fixed, will change dynamically according to the load. Thanks Jerry 2015-05-27 14:44 GMT+08:00 canan chen ccn...@gmail.com: It seems the executor number is fixed for the

Re: Is the executor number fixed during the lifetime of one app ?

2015-05-27 Thread canan chen
How does the dynamic allocation works ? I mean does it related with parallelism of my RDD and how does driver know how many executor it needs ? On Wed, May 27, 2015 at 2:49 PM, Saisai Shao sai.sai.s...@gmail.com wrote: It depends on how you use Spark, if you use Spark with Yarn and enable

Re: DataFrame. Conditional aggregation

2015-05-27 Thread Masf
Yes. I think that your sql solution will work but I was looking for a solution with DataFrame API. I had thought to use UDF such as: val myFunc = udf {(input: Int) = {if (input 100) 1 else 0}} Although I'd like to know if it's possible to do it directly in the aggregation inserting a lambda

Re: Spark Streaming from Kafka - no receivers and spark.streaming.receiver.maxRate?

2015-05-27 Thread Tathagata Das
You can throttle the no receiver direct Kafka stream using spark.streaming.kafka.maxRatePerPartition http://spark.apache.org/docs/latest/configuration.html#spark-streaming On Wed, May 27, 2015 at 4:34 PM, Ted Yu yuzhih...@gmail.com wrote: Have you seen

Re: How does spark manage the memory of executor with multiple tasks

2015-05-27 Thread canan chen
Thanks Yong, this is very helpful. And found ShuffleMemoryManager which is used to allocate memory across tasks in one executor. These 2 tasks have to share the 2G heap memory. I don't think specifying the memory per task is a good idea, as task is running in the Thread level, and Memory only

Re: SparkR Jobs Hanging in collectPartitions

2015-05-27 Thread Shivaram Venkataraman
Could you try to see which phase is causing the hang ? i.e. If you do a count() after flatMap does that work correctly ? My guess is that the hang is somehow related to data not fitting in the R process memory but its hard to say without more diagnostic information. Thanks Shivaram On Tue, May

Re: Spark Streaming from Kafka - no receivers and spark.streaming.receiver.maxRate?

2015-05-27 Thread Dmitry Goldenberg
Got it, thank you, Tathagata and Ted. Could you comment on my other question http://apache-spark-user-list.1001560.n3.nabble.com/Autoscaling-Spark-cluster-based-on-topic-sizes-rate-of-growth-in-Kafka-or-Spark-s-metrics-tt23062.html as well? Basically, I'm trying to get a handle on a good

Value for SPARK_EXECUTOR_CORES

2015-05-27 Thread Mulugeta Mammo
My executor has the following spec (lscpu): CPU(s): 16 Core(s) per socket: 4 Socket(s): 2 Thread(s) per code: 2 The CPU count is obviously 4*2*2 = 16. My question is what value is Spark expecting in SPARK_EXECUTOR_CORES ? The CPU count (16) or total # of cores (2 * 2 = 4) ? Thanks

Re: Intermittent difficulties for Worker to contact Master on same machine in standalone

2015-05-27 Thread Yana Kadiyska
What does your master log say -- normally the master should NEVER shut down...you should be able to spark-submit to infinity with no issues...So the question about high variance on upstart is one issue, but the other thing that's puzzling to me is why your master is ever down to begin

Re: Model weights of linear regression becomes abnormal values

2015-05-27 Thread Maheshakya Wijewardena
Thanks for the information. I'll try that out with Spark 1.4. On Thu, May 28, 2015 at 9:54 AM, DB Tsai dbt...@dbtsai.com wrote: LinearRegressionWithSGD requires to tune the step size and # of iteration very carefully. Please try Linear Regression with elastic net implementation in Spark 1.4

Re: Spark Streaming from Kafka - no receivers and spark.streaming.receiver.maxRate?

2015-05-27 Thread Ted Yu
Have you seen http://stackoverflow.com/questions/29051579/pausing-throttling-spark-spark-streaming-application ? Cheers On Wed, May 27, 2015 at 4:11 PM, dgoldenberg dgoldenberg...@gmail.com wrote: Hi, With the no receivers approach to streaming from Kafka, is there a way to set something

Re: Pointing SparkSQL to existing Hive Metadata with data file locations in HDFS

2015-05-27 Thread ayan guha
Yes, you are at right path. Only thing to remember is placing hive site XML to correct path so spark can talk to hive metastore. Best Ayan On 28 May 2015 10:53, Sanjay Subramanian sanjaysubraman...@yahoo.com.invalid wrote: hey guys On the Hive/Hadoop ecosystem we have using Cloudera

Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-27 Thread Ted Yu
bq. detect the presence of a new node and start utilizing it My understanding is that Spark is concerned with managing executors. Whether request for an executor is fulfilled on an existing node or a new node is up to the underlying cluster manager (YARN e.g.). Assuming the cluster is single

Re: Model weights of linear regression becomes abnormal values

2015-05-27 Thread 吴明瑜
Sorry. I mean the parameter step. 2015-05-28 12:21 GMT+08:00 Maheshakya Wijewardena mahesha...@wso2.com: What is the parameter for the learning rate alpha? LinearRegressionWithSGD has only following parameters. @param data: The training data. @param iterations:The

RE: Pointing SparkSQL to existing Hive Metadata with data file locations in HDFS

2015-05-27 Thread Cheng, Hao
Yes, but be sure you put the hive-site.xml under your class path. Any problem you meet? Cheng Hao From: Sanjay Subramanian [mailto:sanjaysubraman...@yahoo.com.INVALID] Sent: Thursday, May 28, 2015 8:53 AM To: user Subject: Pointing SparkSQL to existing Hive Metadata with data file locations in

Fwd: Model weights of linear regression becomes abnormal values

2015-05-27 Thread Maheshakya Wijewardena
Hi, I'm trying to use Sparks' *LinearRegressionWithSGD* in PySpark with the attached dataset. The code is attached. When I check the model weights vector after training, it contains `nan` values. [nan,nan,nan,nan,nan,nan,nan,nan] But for some data sets, this problem does not occur. What might

Re: Model weights of linear regression becomes abnormal values

2015-05-27 Thread DB Tsai
LinearRegressionWithSGD requires to tune the step size and # of iteration very carefully. Please try Linear Regression with elastic net implementation in Spark 1.4 in ML framework, which uses quasi newton method and step size will be automatically determined. That implementation also matches the

Re: Multilabel classification using logistic regression

2015-05-27 Thread Peter Garbers
Hi Joseph, I looked at that but it seems that LogisticRegressionWithLBFGS's run method takes RDD[LabeledPoint] objects so I'm not sure it's exactly how one would use it in the way I think you're describing On Wed, May 27, 2015 at 4:04 PM, Joseph Bradley jos...@databricks.com wrote: It looks

Pointing SparkSQL to existing Hive Metadata with data file locations in HDFS

2015-05-27 Thread Sanjay Subramanian
hey guys On the Hive/Hadoop ecosystem we have using Cloudera distribution CDH 5.2.x , there are about 300+ hive tables.The data is stored an text (moving slowly to Parquet) on HDFS.I want to use SparkSQL and point to the Hive metadata and be able to define JOINS etc using a programming

RE: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-27 Thread Rajesh_Kalluri
Dell - Internal Use - Confidential Did you check https://drive.google.com/file/d/0B7tmGAdbfMI2OXl6azYySk5iTGM/edit and http://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation Not sure if the spark kafka receiver emits metrics on the lag, check this link out

Re: Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-05-27 Thread Ji ZHANG
Hi, Yes, I'm using createStream, but the storageLevel param is by default MEMORY_AND_DISK_SER_2. Besides, the driver's memory is also growing. I don't think Kafka messages will be cached in driver. On Thu, May 28, 2015 at 12:24 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Are you using the

Spark Streaming from Kafka - no receivers and spark.streaming.receiver.maxRate?

2015-05-27 Thread dgoldenberg
Hi, With the no receivers approach to streaming from Kafka, is there a way to set something like spark.streaming.receiver.maxRate so as not to overwhelm the Spark consumers? What would be some of the ways to throttle the streamed messages so that the consumers don't run out of memory? --

Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-27 Thread dgoldenberg
Hi, I'm trying to understand if there are design patterns for autoscaling Spark (add/remove slave machines to the cluster) based on the throughput. Assuming we can throttle Spark consumers, the respective Kafka topics we stream data from would start growing. What are some of the ways to

Re: Is the executor number fixed during the lifetime of one app ?

2015-05-27 Thread Saisai Shao
The drive has a heuristic mechanism to decide the number of executors in the run-time according the pending tasks. You could enable with configuration, you could refer to spark document to find the details. 2015-05-27 15:00 GMT+08:00 canan chen ccn...@gmail.com: How does the dynamic allocation

Re: Is the executor number fixed during the lifetime of one app ?

2015-05-27 Thread DB Tsai
If with mesos, how do we control the number of executors? In our cluster, each node only has one executor with very big JVM. Sometimes, if the executor dies, all the concurrent running tasks will be gone. We would like to have multiple executors in one node but can not figure out a way to do it in

Re: Is the executor number fixed during the lifetime of one app ?

2015-05-27 Thread DB Tsai
Typo. We can not figure a way to increase the number of executor in one node in mesos. On Wednesday, May 27, 2015, DB Tsai dbt...@dbtsai.com wrote: If with mesos, how do we control the number of executors? In our cluster, each node only has one executor with very big JVM. Sometimes, if the

Re: Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-05-27 Thread Ji ZHANG
Hi Akhil, Thanks for your reply. Accoding to the Streaming tab of Web UI, the Processing Time is around 400ms, and there's no Scheduling Delay, so I suppose it's not the Kafka messages that eat up the off-heap memory. Or maybe it is, but how to tell? I googled about how to check the off-heap

Re: Is the executor number fixed during the lifetime of one app ?

2015-05-27 Thread Saisai Shao
I'm not sure about Mesos, maybe someone has the Mesos experience can help answer this. Thanks Jerry 2015-05-27 15:21 GMT+08:00 DB Tsai dbt...@dbtsai.com: Typo. We can not figure a way to increase the number of executor in one node in mesos. On Wednesday, May 27, 2015, DB Tsai

Re: How does spark manage the memory of executor with multiple tasks

2015-05-27 Thread canan chen
Does anyone can answer my question ? I am curious to know if there's multiple reducer tasks in one executor, how to allocate memory between these reducers tasks since each shuffle will consume a lot of memory ? On Tue, May 26, 2015 at 7:27 PM, Evo Eftimov evo.efti...@isecc.com wrote: the link

Inconsistent behavior with Dataframe Timestamp between 1.3.1 and 1.4.0

2015-05-27 Thread Justin Yip
Hello, I am trying out 1.4.0 and notice there are some differences in behavior with Timestamp between 1.3.1 and 1.4.0. In 1.3.1, I can compare a Timestamp with string. scala val df = sqlContext.createDataFrame(Seq((1, Timestamp.valueOf(2015-01-01 00:00:00)), (2, Timestamp.valueOf(2014-01-01

Re: How to give multiple directories as input ?

2015-05-27 Thread ๏̯͡๏
I am doing that now. Is there no other way ? On Wed, May 27, 2015 at 12:40 PM, Akhil Das ak...@sigmoidanalytics.com wrote: How about creating two and union [ sc.union(first, second) ] them? Thanks Best Regards On Wed, May 27, 2015 at 11:51 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I