Re: Executor metrics in spark application

2014-07-22 Thread Denes
I'm also pretty interested how to create custom Sinks in Spark. I'm using it with Ganglia and the normal metrics from JVM source do show up. I tried to create my own metric based on Issac's code, but does not show up in Ganglia. Does anyone know where is the problem? Here's the code snippet:

Re: Executor metrics in spark application

2014-07-22 Thread Denes
I meant custom Sources, sorry. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Executor-metrics-in-spark-application-tp188p10386.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: Executor metrics in spark application

2014-07-22 Thread Shao, Saisai
Hi Denes, I think you can register your customized metrics source into metrics system through metrics.properties, you can take metrics.propertes.template as reference, Basically you can do as follow if you want to monitor on executor:

number of Cached Partitions v.s. Total Partitions

2014-07-22 Thread Haopu Wang
Hi, I'm using local mode and read a text file as RDD using JavaSparkContext.textFile() API. And then call cache() method on the result RDD. I look at the Storage information and find the RDD has 3 partitions but 2 of them have been cached. Is this a normal behavior? I assume all of

RE: number of Cached Partitions v.s. Total Partitions

2014-07-22 Thread Shao, Saisai
Yes, it's normal when memory is not enough to put the third partition, as you can see in your attached picture. Thanks Jerry From: Haopu Wang [mailto:hw...@qilinsoft.com] Sent: Tuesday, July 22, 2014 3:09 PM To: user@spark.apache.org Subject: number of Cached Partitions v.s. Total Partitions

Re: spark1.0.1 spark sql error java.lang.NoClassDefFoundError: Could not initialize class $line11.$read$

2014-07-22 Thread Victor Sheng
Hi, Yin Huai I test again with your snippet code. It works well in spark-1.0.1 Here is my code: val sqlContext = new org.apache.spark.sql.SQLContext(sc) case class Record(data_date: String, mobile: String, create_time: String) val mobile = Record(2014-07-20,1234567,2014-07-19)

Re: Job aborted due to stage failure: TID x failed for unknown reasons

2014-07-22 Thread Alessandro Lulli
Hi All, Can someone help on this? I'm encountering exactly the same issue in a very similar scenario with the same spark version. Thanks Alessandro On Fri, Jul 18, 2014 at 8:30 PM, Shannon Quinn squ...@gatech.edu wrote: Hi all, I'm dealing with some strange error messages that I *think*

Re: Why spark-submit command hangs?

2014-07-22 Thread Earthson
I've just have the same problem. I'm using pre $SPARK_HOME/bin/spark-submit --master yarn --deploy-mode client $JOBJAR --class $JOBCLASS /pre It's really strange, because the log shows that pre 14/07/22 16:16:58 INFO ui.SparkUI: Started SparkUI at http://k1227.mzhen.cn:4040 14/07/22 16:16:58

Re: gain access to persisted rdd

2014-07-22 Thread mrm
Ok, thanks for the answers. Unfortunately, there is no sc.getPersistentRDDs for pyspark. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/gain-access-to-persisted-rdd-tp10313p10393.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Why spark-submit command hangs?

2014-07-22 Thread Earthson
That's what my problem is:) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-spark-submit-command-hangs-tp10308p10394.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: saveAsSequenceFile for DStream

2014-07-22 Thread Sean Owen
What about simply: dstream.foreachRDD(_.saveAsSequenceFile(...)) ? On Tue, Jul 22, 2014 at 2:06 AM, Barnaby bfa...@outlook.com wrote: First of all, I do not know Scala, but learning. I'm doing a proof of concept by streaming content from a socket, counting the words and write it to a

RE: Executor metrics in spark application

2014-07-22 Thread Denes
Hi Jerry, I know that way of registering a metrics, but it seems defeat the whole purpose. I'd like to define a source that is set within the application, for example number of parsed messages. If I register it in the metrics.properties, how can I obtain the instance? (or instances?) How can I

Spark over graphviz (SPARK-1015, SPARK-975)

2014-07-22 Thread jay vyas
Hi spark. I see there has been some work around graphviz visualization for spark jobs. 1) I'm wondering if anyone actively maintaining this stuff, and if so what the best docs are for it - or else, if there is interest in an upstream JIRA for updating the graphviz APIs it. 2) Also, am curious

RE: Executor metrics in spark application

2014-07-22 Thread Shao, Saisai
Yeah, I start to know your purpose. Original design purpose of customized metrics source is focused on self-contained source, seems you need to rely on outer variable, so the way you mentioned may be is the only way to register. Besides, as you cannot see the source in Ganglia, I think you can

collect() on small group of Avro files causes plain NullPointerException

2014-07-22 Thread Sparky
Running a simple collect method on a group of Avro objects causes a plain NullPointerException. Does anyone know what may be wrong? files.collect() Press ENTER or type command to continue Exception in thread Executor task launch worker-0 java.lang.NullPointerException at

Re: collect() on small list causes NullPointerException

2014-07-22 Thread Sparky
For those curious I was using KryoRegistrator it was causing some null pointer exception. I removed the code and problem went away. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/collect-on-small-list-causes-NullPointerException-tp10400p10402.html Sent

Re: collect() on small group of Avro files causes plain NullPointerException

2014-07-22 Thread Eugen Cepoi
Do you have a list/array in your avro record? If yes this could cause the problem. I experienced this kind of problem and solved it by providing custom kryo ser/de for avro lists. Also be carefull spark reuses records, so if you just read and then don't copy/transform them you would end up with

hadoop version

2014-07-22 Thread mrm
Hi, Where can I find the version of Hadoop my cluster is using? I launched my ec2 cluster using the spark-ec2 script with the --hadoop-major-version=2 option. However, the folder hadoop-native/lib in the master node only contains files that end in 1.0.0. Does that mean that I have Hadoop version

the implications of some items in webUI

2014-07-22 Thread Yifan LI
Hi, I am analysing the application processing on Spark(GraphX), but feeling a little confused on some items of webUI. 1) what is the difference between Duration(Stages - Completed Stages) and Task Time(Executors) ? for instance, 43s VS. 5.6 m Task Time is approximated to Duration multiplied

Using case classes as keys does not seem to work.

2014-07-22 Thread Gerard Maas
Using a case class as a key doesn't seem to work properly. [Spark 1.0.0] A minimal example: case class P(name:String) val ps = Array(P(alice), P(bob), P(charly), P(bob)) sc.parallelize(ps).map(x= (x,1)).reduceByKey((x,y) = x+y).collect [Spark shell local mode] res : Array[(P, Int)] =

Re: Is there anyone who use streaming join to filter spam as guide mentioned?

2014-07-22 Thread hawkwang
Hi TD, Eventually I found that I made a mistake - the RDD I used for join does not contain any content. Now it works. Thanks, Hawk On 2014年07月21日 17:58, Tathagata Das wrote: Could you share your code snippet so that we can take a look? TD On Mon, Jul 21, 2014 at 7:23 AM, hawkwang

Re: Using case classes as keys does not seem to work.

2014-07-22 Thread Gerard Maas
Just to narrow down the issue, it looks like the issue is in 'reduceByKey' and derivates like 'distinct'. groupByKey() seems to work sc.parallelize(ps).map(x= (x.name,1)).groupByKey().collect res: Array[(String, Iterable[Int])] = Array((charly,ArrayBuffer(1)), (abe,ArrayBuffer(1)),

Spark Streaming - How to save all items in batchs from beginning to a single stream rdd?

2014-07-22 Thread hawkwang
Hi guys, Is it possible to generate a single stream rdd which can be updated with new batch rdd content? I know that we can use updateStateByKey to make aggregation, but here just want to keep tracking all historical original content. I also noticed that we can save to redis or other storage

Re: Problem running Spark shell (1.0.0) on EMR

2014-07-22 Thread Martin Goodson
I am also having exactly the same problem, calling using pyspark. Has anyone managed to get this script to work? -- Martin Goodson | VP Data Science (0)20 3397 1240 [image: Inline image 1] On Wed, Jul 16, 2014 at 2:10 PM, Ian Wilkinson ia...@me.com wrote: Hi, I’m trying to run the Spark

Re: Using case classes as keys does not seem to work.

2014-07-22 Thread Daniel Siegmann
I can confirm this bug. The behavior for groupByKey is the same as reduceByKey - your example is actually grouping on just the name. Try this: sc.parallelize(ps).map(x= (x,1)).groupByKey().collect res1: Array[(P, Iterable[Int])] = Array((P(bob),ArrayBuffer(1)), (P(bob),ArrayBuffer(1)),

Tranforming flume events using Spark transformation functions

2014-07-22 Thread Sundaram, Muthu X.
Hi All, I am getting events from flume using following line. JavaDStreamSparkFlumeEvent flumeStream = FlumeUtils.createStream(ssc, host, port); Each event is a delimited record. I like to use some of the transformation functions like map and reduce on this. Do I need to convert the

Re: Using case classes as keys does not seem to work.

2014-07-22 Thread Gerard Maas
Yes, right. 'sc.parallelize(ps).map(x= (**x.name**,1)).groupByKey().collect ' An oversight from my side. Thanks!, Gerard. On Tue, Jul 22, 2014 at 5:24 PM, Daniel Siegmann daniel.siegm...@velos.io wrote: I can confirm this bug. The behavior for groupByKey is the same as reduceByKey - your

Re: saveAsSequenceFile for DStream

2014-07-22 Thread Barnaby Falls
Thanks Sean! I got that working last night similar to how you solved it. Any ideas about how to monitor that same folder in another script by creating a stream? I can use sc.sequenceFile() to read in the RDD, but how do I get the name of the file that got added since there is no

Re: Using case classes as keys does not seem to work.

2014-07-22 Thread Gerard Maas
I created https://issues.apache.org/jira/browse/SPARK-2620 to track this. Maybe useful to know, this is a regression on Spark 1.0.0. I tested the same sample code on 0.9.1 and it worked (we have several jobs using case classes as key aggregators, so it better does) -kr, Gerard. On Tue, Jul 22,

Re: Spark Streaming source from Amazon Kinesis

2014-07-22 Thread Chris Fregly
i took this over from parviz. i recently submitted a new PR for Kinesis Spark Streaming support: https://github.com/apache/spark/pull/1434 others have tested it with good success, so give it a whirl! waiting for it to be reviewed/merged. please put any feedback into the PR directly. thanks!

Re: data locality

2014-07-22 Thread Sandy Ryza
On standalone there is still special handling for assigning tasks within executors. There just isn't special handling for where to place executors, because standalone generally places an executor on every node. On Mon, Jul 21, 2014 at 7:42 PM, Haopu Wang hw...@qilinsoft.com wrote: Sandy,

Re: Spark 1.0.1 SQL on 160 G parquet file (snappy compressed, made by cloudera impala), 23 core and 60G mem / node, yarn-client mode, always failed

2014-07-22 Thread Andre Schumacher
Hi, I don't think anybody has been testing importing of Impala tables directly. Is there any chance to export these first, say as unpartitioned Hive tables and import these? Just an idea.. Andre On 07/21/2014 11:46 PM, chutium wrote: no, something like this 14/07/20 00:19:29 ERROR

Spark sql with hive table running on Yarn-cluster mode

2014-07-22 Thread Jenny Zhao
Hi, For running spark sql, the dataneuclus*.jar are automatically added in classpath, this works fine for spark standalone mode and yarn-client mode, however, for Yarn-cluster mode, I have to explicitly put these jars using --jars option when submitting this job, otherwise, the job will fail, why

Spark app vs SparkSQL app

2014-07-22 Thread buntu
I could possible use Spark API and write an batch app to provide some per web page stats such as views, uniques etc. The same can be achieved using SparkSQL, so wanted to check: * what are the best practices and pros/cons of either of the approaches? * Does SparkSQL require registerAsTable for

Re: Why spark-submit command hangs?

2014-07-22 Thread Andrew Or
Hi Earthson, Is your problem resolved? The way you submit your application looks alright to me; spark-submit should be able to parse the combination of --master and --deploy-mode correctly. I suspect you might have hard-coded yarn-cluster or something in your application. Andrew 2014-07-22

Re: hadoop version

2014-07-22 Thread Andrew Or
Hi Maria, Having files that end with 1.0.0 means you're Spark 1.0, not Hadoop 1.0. You can check your hadoop version by running $HADOOP_HOME/bin/hadoop version, where HADOOP_HOME is set to your installation of hadoop. On the clusters started by the Spark ec2 scripts, this should be

Re: spark streaming rate limiting from kafka

2014-07-22 Thread Bill Jay
Hi Tobias, I tried to use 10 as numPartition. The number of executors allocated is the number of DStream. Therefore, it seems the parameter does not spread data into many partitions. In order to to that, it seems we have to do repartition. If numPartitions will distribute the data to multiple

Re: Spark job tracker.

2014-07-22 Thread Marcelo Vanzin
I don't understand what you're trying to do. The code will use log4j under the covers. The default configuration means writing log messages to stderr. In yarn-client mode that is your terminal screen, in yarn-cluster mode that is redirected to a file by Yarn. For the executors, that will always

combineByKey at ShuffledDStream.scala

2014-07-22 Thread Bill Jay
Hi all, I am currently running a Spark Streaming program, which consumes data from Kakfa and does the group by operation on the data. I try to optimize the running time of the program because it looks slow to me. It seems the stage named: * combineByKey at ShuffledDStream.scala:42 * always

Re: Spark 1.0.1 SQL on 160 G parquet file (snappy compressed, made by cloudera impala), 23 core and 60G mem / node, yarn-client mode, always failed

2014-07-22 Thread Sandy Ryza
I haven't had a chance to look at the details of this issue, but we have seen Spark successfully read Parquet tables created by Impala. On Tue, Jul 22, 2014 at 10:10 AM, Andre Schumacher andre.sc...@gmail.com wrote: Hi, I don't think anybody has been testing importing of Impala tables

Need info on log4j.properties for apache spark.

2014-07-22 Thread abhiguruvayya
Hello All, Basically i need to edit the log4j.properties to filter some of the unnecessary logs in spark on yarn-client mode. I am not sure where can i find log4j.properties file (location). Can any one help me on this. -- View this message in context:

Re: Spark job tracker.

2014-07-22 Thread abhiguruvayya
I fixed the error with the yarn-client mode issue which i mentioned in my earlier post. Now i want to edit the log4j.properties to filter some of the unnecessary logs. Can you let me know where can i find this properties file. -- View this message in context:

Spark Streaming: no job has started yet

2014-07-22 Thread Bill Jay
Hi all, I am running a spark streaming job. The job hangs on one stage, which shows as follows: Details for Stage 4 Summary MetricsNo tasks have started yetTasksNo tasks have started yet Does anyone have an idea on this? Thanks! Bill Bill

What if there are large, read-only variables shared by all map functions?

2014-07-22 Thread Parthus
Hi there, I was wondering if anybody could help me find an efficient way to make a MapReduce program like this: 1) For each map function, it need access some huge files, which is around 6GB 2) These files are READ-ONLY. Actually they are like some huge look-up table, which will not change

Re: Very wierd behavior

2014-07-22 Thread Matei Zaharia
Is the first() being computed locally on the driver program? Maybe it's to hard to compute with the memory, etc available there. Take a look at the driver's log and see whether it has the message Computing the requested partition locally. Matei On Jul 22, 2014, at 12:04 PM, Nathan Kronenfeld

RE: Hive From Spark

2014-07-22 Thread Andrew Lee
Hi Sean, Thanks for clarifying. I re-read SPARK-2420 and now have a better understanding. From a user perspective, what would you recommend to build Spark with Hive 0.12 / 0.13+ libraries moving forward and deploy to production cluster that runs on a older version of Hadoop (e.g. 2.2 or 2.4) ?

Re: Spark job tracker.

2014-07-22 Thread Marcelo Vanzin
You can upload your own log4j.properties using spark-submit's --files argument. On Tue, Jul 22, 2014 at 12:45 PM, abhiguruvayya sharath.abhis...@gmail.com wrote: I fixed the error with the yarn-client mode issue which i mentioned in my earlier post. Now i want to edit the log4j.properties to

RE: Tranforming flume events using Spark transformation functions

2014-07-22 Thread Sundaram, Muthu X.
I tried to map SparkFlumeEvents to String of RDDs like below. But that map and call are not at all executed. I might be doing this in a wrong way. Any help would be appreciated. flumeStream.foreach(new FunctionJavaRDDSparkFlumeEvent,Void () { @Override public Void

Re: Spark job tracker.

2014-07-22 Thread abhiguruvayya
Thanks i am able to load the file now. Can i turn off specific logs using log4j.properties. I don't want to see the below logs. How can i do this. 14/07/22 14:01:24 INFO scheduler.TaskSetManager: Starting task 2.0:129 as TID 129 on executor 3: ** (NODE_LOCAL) 14/07/22 14:01:24 INFO

Re: Apache kafka + spark + Parquet

2014-07-22 Thread buntu
Now we are storing Data direct from Kafka to Parquet. We are currently using Camus and wanted to know how you went about storing to Parquet? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Apache-kafka-spark-Parquet-tp10037p10441.html Sent from the Apache

Re: Spark job tracker.

2014-07-22 Thread Marcelo Vanzin
The spark log classes are based on the actual class names. So if you want to filter out a package's logs you need to specify the full package name (e.g. org.apache.spark.storage instead of just spark.storage). On Tue, Jul 22, 2014 at 2:07 PM, abhiguruvayya sharath.abhis...@gmail.com wrote:

How to do an interactive Spark SQL

2014-07-22 Thread hsy...@gmail.com
Hi guys, I'm able to run some Spark SQL example but the sql is static in the code. I would like to know is there a way to read sql from somewhere else (shell for example) I could read sql statement from kafka/zookeeper, but I cannot share the sql to all workers. broadcast seems not working for

Re: How to do an interactive Spark SQL

2014-07-22 Thread Zongheng Yang
Do you mean that the texts of the SQL queries being hardcoded in the code? What do you mean by cannot shar the sql to all workers? On Tue, Jul 22, 2014 at 4:03 PM, hsy...@gmail.com hsy...@gmail.com wrote: Hi guys, I'm able to run some Spark SQL example but the sql is static in the code. I

Spark clustered client

2014-07-22 Thread Asaf Lahav
Hi Folks, I have been trying to dig up some information in regards to what are the possibilities when wanting to deploy more than one client process that consumes Spark. Let's say I have a Spark Cluster of 10 servers, and would like to setup 2 additional servers which are sending requests to it

Re: How to do an interactive Spark SQL

2014-07-22 Thread hsy...@gmail.com
Sorry, typo. What I mean is sharing. If the sql is changing at runtime, how do I broadcast the sql to all workers that is doing sql analysis. Best, Siyuan On Tue, Jul 22, 2014 at 4:15 PM, Zongheng Yang zonghen...@gmail.com wrote: Do you mean that the texts of the SQL queries being hardcoded

Re: How to do an interactive Spark SQL

2014-07-22 Thread Zongheng Yang
Can you paste a small code example to illustrate your questions? On Tue, Jul 22, 2014 at 5:05 PM, hsy...@gmail.com hsy...@gmail.com wrote: Sorry, typo. What I mean is sharing. If the sql is changing at runtime, how do I broadcast the sql to all workers that is doing sql analysis. Best,

Re: the implications of some items in webUI

2014-07-22 Thread Ankur Dave
On Tue, Jul 22, 2014 at 7:08 AM, Yifan LI iamyifa...@gmail.com wrote: 1) what is the difference between Duration(Stages - Completed Stages) and Task Time(Executors) ? Stages are composed of tasks that run on executors. Tasks within a stage may run concurrently, since there are multiple

RE: Joining by timestamp.

2014-07-22 Thread durga
Thanks Chen -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Joining-by-timestamp-tp10367p10449.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

How could I start new spark cluster with hadoop2.0.2

2014-07-22 Thread durga
Hi, I am trying to create spark cluster using spark-ec2 file under spark1.0.1 directory. 1) I noticed that It is always creating hadoop version 1.0.4.Is there a way I can override that?I would like to have hadoop2.0.2 2) I also wants install Oozie along with. Is there any scrips available along

Re: How to do an interactive Spark SQL

2014-07-22 Thread hsy...@gmail.com
For example, this is what I tested and work on local mode, what it does is it get data and sql query both from kafka and do sql on each RDD and output the result back to kafka again I defined a var called *sqlS. * In the streaming part as you can see I change the sql statement if it consumes a sql

Re: How to do an interactive Spark SQL

2014-07-22 Thread Tobias Pfeiffer
Hi, as far as I know, after the Streaming Context has started, the processing pipeline (e.g., filter.map.join.filter) cannot be changed. As your SQL statement is transformed into RDD operations when the Streaming Context starts, I think there is no way to change the statement that is executed on

streaming window not behaving as advertised (v1.0.1)

2014-07-22 Thread Alan Ngai
I have a sample application pumping out records 1 per second. The batch interval is set to 5 seconds. Here’s a list of “observed window intervals” vs what was actually set window=25, slide=25 : observed-window=25, overlapped-batches=0 window=25, slide=20 : observed-window=20,

Re: How to do an interactive Spark SQL

2014-07-22 Thread hsy...@gmail.com
But how do they do the interactive sql in the demo? https://www.youtube.com/watch?v=dJQ5lV5Tldw And if it can work in the local mode. I think it should be able to work in cluster mode, correct? On Tue, Jul 22, 2014 at 5:58 PM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, as far as I know,

Where is the PowerGraph abstraction

2014-07-22 Thread shijiaxin
I download the spark 1.0.1, but I cannot find the PowerGraph abstraction mentioned in the GraphX paper. What I can find is the pregel abstraction. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Where-is-the-PowerGraph-abstraction-tp10457.html Sent from

Re: Spark Streaming source from Amazon Kinesis

2014-07-22 Thread Tathagata Das
I will take a look at it tomorrow! TD On Tue, Jul 22, 2014 at 9:30 AM, Chris Fregly ch...@fregly.com wrote: i took this over from parviz. i recently submitted a new PR for Kinesis Spark Streaming support: https://github.com/apache/spark/pull/1434 others have tested it with good success,

Re: combineByKey at ShuffledDStream.scala

2014-07-22 Thread Tathagata Das
Can you give an idea of the streaming program? Rest of the transformation you are doing on the input streams? On Tue, Jul 22, 2014 at 11:05 AM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi all, I am currently running a Spark Streaming program, which consumes data from Kakfa and does the

Re: spark1.0.1 spark sql error java.lang.NoClassDefFoundError: Could not initialize class $line11.$read$

2014-07-22 Thread Yin Huai
It is caused by a bug in Spark REPL. I still do not know which part of the REPL code causes it... I think people working REPL may have better idea. Regarding how I found it, based on exception, it seems we pulled in some irrelevant stuff and that import was pretty suspicious. Thanks, Yin On

Re: streaming window not behaving as advertised (v1.0.1)

2014-07-22 Thread Tathagata Das
It could be related to this bug that is currently open. https://issues.apache.org/jira/browse/SPARK-1312 Here is a workaround. Can you put a inputStream.foreachRDD(rdd = { }) and try these combos again? TD On Tue, Jul 22, 2014 at 6:01 PM, Alan Ngai a...@opsclarity.com wrote: I have a sample

Re: Tranforming flume events using Spark transformation functions

2014-07-22 Thread Tathagata Das
This is because of the RDD's lazy evaluation! Unless you force a transformed (mapped/filtered/etc.) RDD to give you back some data (like RDD.count) or output the data (like RDD.saveAsTextFile()), Spark will not do anything. So after the eventData.map(...), if you do take(10) and then print the

Re: Caching issue with msg: RDD block could not be dropped from memory as it does not exist

2014-07-22 Thread rindra
Hello Andrew, Thank you very much for your great tips. Your solution worked perfectly. In fact, I was not aware that the right option for local mode is --driver.memory 1g Cheers, Rindra On Mon, Jul 21, 2014 at 11:23 AM, Andrew Or-2 [via Apache Spark User List]

RE: Executor metrics in spark application

2014-07-22 Thread Denes
As far as I understand even if I could register the custom source, there is no way to have a cluster-wide variable to pass to it, i.e. the accumulator can be modified by tasks, but only the driver can read it and the broadcast value is constant. So it seems this custom metrics/sinks fuctionality

Re: akka disassociated on GC

2014-07-22 Thread Makoto Yui
Hi Xiangrui, By using your treeAggregate and broadcast patch, the evaluation has been processed successfully. I expect that these patches are merged in the next major release (v1.1?). Without them, it would be hard to use mllib for a large dataset. Thanks, Makoto (2014/07/16 15:05),