Re: Dynamic visualizations from Spark Streaming output?

2014-10-01 Thread Akhil Das
You can start with the Zeppelin project http://zeppelin-project.org/, they currently has support for SparkContext (not streaming), since the code is open you can customize it for your usecase. Here's a video of it https://www.youtube.com/watch?v=_PQbVH_aO5Efeature=youtu.be Thanks Best Regards On

Re: Multiple exceptions in Spark Streaming

2014-10-01 Thread Akhil Das
Looks like a configuration issue, can you paste your spark-env.sh on the worker? Thanks Best Regards On Wed, Oct 1, 2014 at 8:27 AM, Tathagata Das tathagata.das1...@gmail.com wrote: It would help to turn on debug level logging in log4j and see the logs. Just looking at the error logs is not

Re: Multiple exceptions in Spark Streaming

2014-10-01 Thread Shaikh Riyaz
Hi Akhil, Thanks for your reply. We are using CDH 5.1.3 and spark configuration is taken care by Cloudera configuration. Please let me know if you would like to review the configuration. Regards, Riyaz On Wed, Oct 1, 2014 at 10:10 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Looks like a

Re: Multiple exceptions in Spark Streaming

2014-10-01 Thread Akhil Das
In that case, fire-up a sparkshell and try the following: scalaimport org.apache.spark.streaming.{Seconds, StreamingContext} scalaimport org.apache.spark.streaming.StreamingContext._ scalaval ssc = new StreamingContext(spark://YOUR-SPARK-MASTER-URI,Streaming

persistent state for spark streaming

2014-10-01 Thread Chia-Chun Shih
Hi, My application is to digest user logs and deduct user quotas. I need to maintain latest states of user quotas persistently, so that latest user quotas will not be lost. I have tried *updateStateByKey* to generate and a DStream for user quotas and called

any code examples demonstrating spark streaming applications which depend on states?

2014-10-01 Thread Chia-Chun Shih
Hi, Are there any code examples demonstrating spark streaming applications which depend on states? That is, last-run *updateStateByKey* results are used as inputs. Thanks.

Re: how to get actual count from as long from JavaDStream ?

2014-10-01 Thread Sean Owen
It's much easier than all this. Spark Streaming gives you a DStream of RDDs. You want the count for each RDD. DStream.count() gives you exactly that: a DStream of Longs which are the counts of events in each mini batch. On Tue, Sep 30, 2014 at 8:42 PM, Andy Davidson a...@santacruzintegration.com

Spark AccumulatorParam generic

2014-10-01 Thread Johan Stenberg
Hi, I have a problem with using accumulators in Spark. As seen on the Spark website, if you want custom accumulators you can simply extend (with an object) the AccumulatorParam trait. The problem is that I need to make that object generic, such as this: object SeqAccumulatorParam[B] extends

Re: Spark AccumulatorParam generic

2014-10-01 Thread Johan Stenberg
Just realized that, of course, objects can't be generic, but how do I create a generic AccumulatorParam? 2014-10-01 12:33 GMT+02:00 Johan Stenberg johanstenber...@gmail.com: Hi, I have a problem with using accumulators in Spark. As seen on the Spark website, if you want custom accumulators

Re: Installation question

2014-10-01 Thread Sean Owen
If you want a single-machine 'cluster' to try all of these things, you don't strictly need a distribution, but, it will probably save you a great deal of time and trouble compared to setting all of this up by hand. Naturally I would promote CDH, as it contains Spark and Mahout and supports them

Re: Spark SQL + Hive + JobConf NoClassDefFoundError

2014-10-01 Thread Patrick McGloin
FYI, in case anybody else has this problem, we switched to Spark 1.1 (outside CDH) and the same Spark application worked first time (once recompiled with Spark 1.1 libs of course). I assume this is because Spark 1.1 is compiled with Hive. On 29 September 2014 17:41, Patrick McGloin

Re: MLLib ALS question

2014-10-01 Thread Alex T
Hi, thanks for the reply. I added the ALS.setIntermediateRDDStorageLevel and it worked well (a little slow, but still did the job and i've made MF and get all the features). But even if I persist with DISK_ONLY, the system monitor shows on the memory and swap history that Apache Spark is using

Re: Spark Streaming for time consuming job

2014-10-01 Thread Mayur Rustagi
Calling collect on anything is almost always a bad idea. The only exception is if you are looking to pass that data on to any other system never see it again :) . I would say you need to implement outlier detection on the rdd process it in spark itself rather than calling collect on it.

Solution for small files in HDFS

2014-10-01 Thread rzykov
We encountered a problem of loading a huge number of small files (hundred thousands of files) from HDFS in Spark. Our jobs were failed over time. This one forced us to write own loader with combining by means of Hadoop CombineFileInputFormat. It significantly reduced number of mappers from 10

KryoSerializer exception in Spark Streaming JAVA

2014-10-01 Thread Mudassar Sarwar
Hi, I'm implementing KryoSerializer for my custom class. Here is class public class ImpressionFactsValue implements KryoSerializable { private int hits; public ImpressionFactsValue() { } public int getHits() {

Re: persistent state for spark streaming

2014-10-01 Thread Yana Kadiyska
I don't think persist is meant for end-user usage. You might want to call saveAsTextFiles, for example, if you're saving to the file system as strings. You can also dump the DStream to a DB -- there are samples on this list (you'd have to do a combo of foreachRDD and mapPartitions, likely) On

Re: How to get SparckContext inside mapPartitions?

2014-10-01 Thread Daniel Siegmann
I don't think you can get a SparkContext inside an RDD function (such as mapPartitions), but you shouldn't need to. Have you considered returning the data read from the database from mapPartitions to create a new RDD and then just save it to a file like normal? For example:

Problem with very slow behaviour of TorrentBroadcast vs. HttpBroadcast

2014-10-01 Thread Guillaume Pitel
Hi, We've had some performance issues since switching to 1.1.0, and we finally found the origin : TorrentBroadcast seems to be very slow in our setting (and it became default with 1.1.0) The logs of a 4MB variable with TB : (15s) 14/10/01 15:47:13 INFO storage.MemoryStore: Block

GraphX: Types for the Nodes and Edges

2014-10-01 Thread Oshi
Hi, Sorry this question may be trivial. I'm new to Spark and GraphX. I need to create a graph that has different types of nodes(3 types) and edges(4 types). Each type of node and edge has a different list of attributes. 1) How should I build the graph? Should I specify all types of nodes(or

Re: GraphX: Types for the Nodes and Edges

2014-10-01 Thread andy petrella
I'll try my best ;-). 1/ you could create a abstract type for the types (1 on top of Vs, 1 other on top of Es types) than use the subclasses as payload in your VertexRDD or in your Edge. Regarding storage and files, it doesn't really matter (unless you want to use the OOTB loading method, thus

Re: Spark AccumulatorParam generic

2014-10-01 Thread Debasish Das
Can't you extend a class in place of object which can be generic ? class GenericAccumulator[B] extends AccumulatorParam[Seq[B]] { } On Wed, Oct 1, 2014 at 3:38 AM, Johan Stenberg johanstenber...@gmail.com wrote: Just realized that, of course, objects can't be generic, but how do I create a

MultipleTextOutputFormat with new hadoop API

2014-10-01 Thread Tomer Benyamini
Hi, I'm trying to write my JavaPairRDD using saveAsNewAPIHadoopFile with MultipleTextOutputFormat,: outRdd.saveAsNewAPIHadoopFile(/tmp, String.class, String.class, MultipleTextOutputFormat.class); but I'm getting this compilation error: Bound mismatch: The generic method

Relation between worker memory and executor memory in standalone mode

2014-10-01 Thread Akshat Aranya
Hi, What's the relationship between Spark worker and executor memory settings in standalone mode? Do they work independently or does the worker cap executor memory? Also, is the number of concurrent executors per worker capped by the number of CPU cores configured for the worker?

Re: MultipleTextOutputFormat with new hadoop API

2014-10-01 Thread Nicholas Chammas
Are you trying to do something along the lines of what's described here? https://issues.apache.org/jira/browse/SPARK-3533 On Wed, Oct 1, 2014 at 10:53 AM, Tomer Benyamini tomer@gmail.com wrote: Hi, I'm trying to write my JavaPairRDD using saveAsNewAPIHadoopFile with

org.apache.spark.sql.catalyst.errors.package$TreeNodeException:

2014-10-01 Thread tonsat
We have a configuration CDH5.0,Spark1.1.0(stand alone) and hive0.12 We are trying to run some realtime analytics from Tableau8.1(BI Tool) and one of the dashboard was failing with below error. Is it something to do with the aggregate function not supported by spark-sql ?Any help appreciated.

Re: MultipleTextOutputFormat with new hadoop API

2014-10-01 Thread Tomer Benyamini
Yes exactly.. so I guess this is still an open request. Any workaround? On Wed, Oct 1, 2014 at 6:04 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Are you trying to do something along the lines of what's described here? https://issues.apache.org/jira/browse/SPARK-3533 On Wed, Oct 1,

Re: Reading from HBase is too slow

2014-10-01 Thread Tao Xiao
I can submit a MapReduce job reading that table, although its processing rate is also a litter slower than I expected, but not that slow as Spark. 2014-10-01 12:04 GMT+08:00 Ted Yu yuzhih...@gmail.com: Can you launch a job which exercises TableInputFormat on the same table without using Spark

Re: MultipleTextOutputFormat with new hadoop API

2014-10-01 Thread Nicholas Chammas
Not that I'm aware of. I'm looking for a work-around myself! On Wed, Oct 1, 2014 at 11:15 AM, Tomer Benyamini tomer@gmail.com wrote: Yes exactly.. so I guess this is still an open request. Any workaround? On Wed, Oct 1, 2014 at 6:04 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote:

Re: MultipleTextOutputFormat with new hadoop API

2014-10-01 Thread Nicholas Chammas
There is this thread on Stack Overflow http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job about the same topic, which you may find helpful. On Wed, Oct 1, 2014 at 11:17 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Not that I'm aware of.

Re: MLLib: Missing value imputation

2014-10-01 Thread Debasish Das
If the missing values are 0, then you can also look into implicit formulation... On Tue, Sep 30, 2014 at 12:05 PM, Xiangrui Meng men...@gmail.com wrote: We don't handle missing value imputation in the current version of MLlib. In future releases, we can store feature information in the

spark.driver.memory is not set (pyspark, 1.1.0)

2014-10-01 Thread jamborta
Hi all, I cannot figure out why this command is not setting the driver memory (it is setting the executor memory): conf = (SparkConf() .setMaster(yarn-client) .setAppName(test) .set(spark.driver.memory, 1G)

Re: Reading from HBase is too slow

2014-10-01 Thread Vladimir Rodionov
Using TableInputFormat is not the fastest way of reading data from HBase. Do not expect 100s of Mb per sec. You probably should take a look at M/R over HBase snapshots. https://issues.apache.org/jira/browse/HBASE-8369 -Vladimir Rodionov On Wed, Oct 1, 2014 at 8:17 AM, Tao Xiao

Re: spark.driver.memory is not set (pyspark, 1.1.0)

2014-10-01 Thread Marcelo Vanzin
You can't set up the driver memory programatically in client mode. In that mode, the same JVM is running the driver, so you can't modify command line options anymore when initializing the SparkContext. (And you can't really start cluster mode apps that way, so the only way to set this is through

Re: Reading from HBase is too slow

2014-10-01 Thread Ted Yu
As far as I know, that feature is not in CDH 5.0.0 FYI On Wed, Oct 1, 2014 at 9:34 AM, Vladimir Rodionov vrodio...@splicemachine.com wrote: Using TableInputFormat is not the fastest way of reading data from HBase. Do not expect 100s of Mb per sec. You probably should take a look at M/R over

Re: spark.driver.memory is not set (pyspark, 1.1.0)

2014-10-01 Thread Tamas Jambor
thanks Marcelo. What's the reason it is not possible in cluster mode, either? On Wed, Oct 1, 2014 at 5:42 PM, Marcelo Vanzin van...@cloudera.com wrote: You can't set up the driver memory programatically in client mode. In that mode, the same JVM is running the driver, so you can't modify

Re: Spark 1.1.0 hbase_inputformat.py not work

2014-10-01 Thread Kan Zhang
CC user@ for indexing. Glad you fixed it. All source code for these examples are under SPARK_HOME/examples. For example, the converters used here are in examples/src/main/scala/org/apache/spark/examples/pythonconverters/HBaseConverters.scala Btw, you may find our blog post useful.

Re: MLLib ALS question

2014-10-01 Thread Xiangrui Meng
ALS still needs to load and deserialize the in/out blocks (one by one) from disk and then construct least squares subproblems. All happen in RAM. The final model is also stored in memory. -Xiangrui On Wed, Oct 1, 2014 at 4:36 AM, Alex T chiorts...@gmail.com wrote: Hi, thanks for the reply. I

Re: spark.driver.memory is not set (pyspark, 1.1.0)

2014-10-01 Thread Marcelo Vanzin
Because that's not how you launch apps in cluster mode; you have to do it through the command line, or by calling directly the respective backend code to launch it. (That being said, it would be nice to have a programmatic way of launching apps that handled all this - this has been brought up in

Re: spark.driver.memory is not set (pyspark, 1.1.0)

2014-10-01 Thread Tamas Jambor
when you say respective backend code to launch it, I thought this is the way to do that. thanks, Tamas On Wed, Oct 1, 2014 at 6:13 PM, Marcelo Vanzin van...@cloudera.com wrote: Because that's not how you launch apps in cluster mode; you have to do it through the command line, or by calling

Re: spark.driver.memory is not set (pyspark, 1.1.0)

2014-10-01 Thread jamborta
when you say respective backend code to launch it, I thought this is the way to do that. thanks, Tamas On Wed, Oct 1, 2014 at 6:13 PM, Marcelo Vanzin [via Apache Spark User List] ml-node+s1001560n15506...@n3.nabble.com wrote: Because that's not how you launch apps in cluster mode; you have to

Re: spark.driver.memory is not set (pyspark, 1.1.0)

2014-10-01 Thread Marcelo Vanzin
No, you can't instantiate a SparkContext to start apps in cluster mode. For Yarn, for example, you'd have to call directly into org.apache.spark.deploy.yarn.Client; that class will tell the Yarn cluster to launch the driver for you and then instantiate the SparkContext. On Wed, Oct 1, 2014 at

Re: spark.driver.memory is not set (pyspark, 1.1.0)

2014-10-01 Thread Andrew Or
Hi Tamas, Yes, Marcelo is right. The reason why it doesn't make sense to set spark.driver.memory in your SparkConf is because your application code, by definition, *is* the driver. This means by the time you get to the code that initializes your SparkConf, your driver JVM has already started with

Re: how to get actual count from as long from JavaDStream ?

2014-10-01 Thread Andy Davidson
Hi Sean I guess I am missing something. JavaDStreamString foo = Š JavaDStreamLong c = foo.count() This is circular. I need to get the count as an actual scalar value not a JavaDStream. Some one else posted psudo code that used foreachRDD() . This seems to work for me. Thanks Andy From:

Re: spark.driver.memory is not set (pyspark, 1.1.0)

2014-10-01 Thread jamborta
Thank you for the replies. It makes sense for scala/java, but in python the JVM is launched when the spark context is initialised, so it should be able to set it, I assume. On 1 Oct 2014 18:25, Andrew Or-2 [via Apache Spark User List] ml-node+s1001560n15510...@n3.nabble.com wrote: Hi Tamas,

Re: IPython Notebook Debug Spam

2014-10-01 Thread Davies Liu
On Tue, Sep 30, 2014 at 10:14 PM, Rick Richardson rick.richard...@gmail.com wrote: I am experiencing significant logging spam when running PySpark in IPython Notebok Exhibit A: http://i.imgur.com/BDP0R2U.png I have taken into consideration advice from:

can I think of JavaDStream foreachRDD() as being 'for each mini batch' ?

2014-10-01 Thread Andy Davidson
Hi I am new to Spark Streaming. Can I think of JavaDStream foreachRDD() as being 'for each mini batch¹? The java doc does not say much about this function. Here is the background. I am writing a little test program to figure out how to use streams. At some point I wanted to calculate an

Re: Handling tree reduction algorithm with Spark in parallel

2014-10-01 Thread Andy Twigg
Yes, that makes sense. It's similar to the all reduce pattern in vw. On Wednesday, 1 October 2014, Matei Zaharia matei.zaha...@gmail.com wrote: Some of the MLlib algorithms do tree reduction in 1.1: http://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html. You can

Re: can I think of JavaDStream foreachRDD() as being 'for each mini batch' ?

2014-10-01 Thread Sean Owen
Yes, foreachRDD will do your something for each RDD, which is what you get for each mini-batch of input. The operations you express on a DStream (or JavaDStream) are all, really, for each RDD, including print(). Logging is a little harder to reason about since the logging will happen on a

Re: Poor performance writing to S3

2014-10-01 Thread Gustavo Arjones
Hi, I found the answer to my problem, and just writing to keep it as KB. Turns out the problem wasn’t related to S3 performance, it was due my SOURCE was not fast enough, due the lazy nature of Spark what I saw on the dashboard was saveAsTextFile at FacebookProcessor.scala:46 instead of the

Re: IPython Notebook Debug Spam

2014-10-01 Thread Rick Richardson
Thanks for your reply. Unfortunately changing the log4j.properties within SPARK_HOME/conf has no effect on pyspark for me. When I change it in the master or workers the log changes have the desired effect, but pyspark seems to ignore them. I have changed the levels to WARN, changed the appender

Re: Relation between worker memory and executor memory in standalone mode

2014-10-01 Thread Akshat Aranya
On Wed, Oct 1, 2014 at 11:33 AM, Akshat Aranya aara...@gmail.com wrote: On Wed, Oct 1, 2014 at 11:00 AM, Boromir Widas vcsub...@gmail.com wrote: 1. worker memory caps executor. 2. With default config, every job gets one executor per worker. This executor runs with all cores available to

RE: MLLib: Missing value imputation

2014-10-01 Thread Sameer Tilak
Thanks, Xiangrui and Debashish for your input. Date: Wed, 1 Oct 2014 08:35:51 -0700 Subject: Re: MLLib: Missing value imputation From: debasish.da...@gmail.com To: men...@gmail.com CC: ssti...@live.com; user@spark.apache.org If the missing values are 0, then you can also look into implicit

Re: IPython Notebook Debug Spam

2014-10-01 Thread Davies Liu
How do you setup IPython to access pyspark in notebook? I did as following, it worked for me: $ export SPARK_HOME=/opt/spark-1.1.0/ $ export PYTHONPATH=/opt/spark-1.1.0/python:/opt/spark-1.1.0/python/lib/py4j-0.8.2.1-src.zip $ ipython notebook All the logging will go into console (not in

Re: can I think of JavaDStream foreachRDD() as being 'for each mini batch' ?

2014-10-01 Thread Andy Davidson
Hi Sean Many many thanks. This really clears a lot up for me Andy From: Sean Owen so...@cloudera.com Date: Wednesday, October 1, 2014 at 11:27 AM To: Andrew Davidson a...@santacruzintegration.com Cc: user@spark.apache.org user@spark.apache.org Subject: Re: can I think of JavaDStream

Re: IPython Notebook Debug Spam

2014-10-01 Thread Rick Richardson
I was starting PySpark as a profile within IPython Notebook as per: http://blog.cloudera.com/blog/2014/08/how-to-use-ipython-notebook-with-apache-spark/ My setup looks like: import os import sys spark_home = os.environ.get('SPARK_HOME', None) if not spark_home: raise ValueError('SPARK_HOME

Re: IPython Notebook Debug Spam

2014-10-01 Thread Rick Richardson
Here is the other relevant bit of my set-up: MASTER=spark://sparkmaster:7077 IPYTHON_OPTS=notebook --pylab inline --ip=0.0.0.0 CASSANDRA_NODES=cassandra1|cassandra2|cassandra3 PYSPARK_SUBMIT_ARGS=--master $MASTER --deploy-mode client --num-executors 6 --executor-memory 1g --executor-cores 1

Re: Relation between worker memory and executor memory in standalone mode

2014-10-01 Thread Liquan Pei
One indirect way to control the number of cores used in an executor is to set spark.cores.max and set spark.deploy.spreadOut to be true. The scheduler in the standalone cluster then assigns roughly the same number of cores (spark.cores.max/number of worker nodes) to each executor for an

Spark Monitoring with Ganglia

2014-10-01 Thread danilopds
Hi, I need monitoring some aspects about my cluster like network and resources. Ganglia looks like a good option for what I need. Then, I found out that Spark has support to Ganglia. On the Spark monitoring webpage there is this information: To install the GangliaSink you’ll need to perform a

Re: Question About Submit Application

2014-10-01 Thread danilopds
I'll do this test and after I reply the result. Thank you Marcelo. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Question-About-Submit-Application-tp15072p15539.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

still GC overhead limit exceeded after increasing heap space

2014-10-01 Thread anny9699
Hi, After reading some previous posts about this issue, I have increased the java heap space to -Xms64g -Xmx64g, but still met the java.lang.OutOfMemoryError: GC overhead limit exceeded error. Does anyone have other suggestions? I am reading a data of 200 GB and my total memory is 120 GB, so I

Re: still GC overhead limit exceeded after increasing heap space

2014-10-01 Thread Liquan Pei
Hi How many nodes in your cluster? It seems to me 64g does not help if each of your node doesn't have that many memory. Liquan On Wed, Oct 1, 2014 at 1:37 PM, anny9699 anny9...@gmail.com wrote: Hi, After reading some previous posts about this issue, I have increased the java heap space to

run scalding on spark

2014-10-01 Thread Koert Kuipers
well, sort of! we make input/output formats (cascading taps, scalding sources) available in spark, and we ported the scalding fields api to spark. so it's for those of us that have a serious investment in cascading/scalding and want to leverage that in spark. blog is here:

Re: still GC overhead limit exceeded after increasing heap space

2014-10-01 Thread anny9699
Hi Liquan, I have 8 workers, each with 15.7GB memory. What you said makes sense, but if I don't increase heap space, it keeps telling me GC overhead limit exceeded. Thanks! Anny On Wed, Oct 1, 2014 at 1:41 PM, Liquan Pei [via Apache Spark User List] ml-node+s1001560n1554...@n3.nabble.com

Re: run scalding on spark

2014-10-01 Thread Matei Zaharia
Pretty cool, thanks for sharing this! I've added a link to it on the wiki: https://cwiki.apache.org/confluence/display/SPARK/Supplemental+Spark+Projects. Matei On Oct 1, 2014, at 1:41 PM, Koert Kuipers ko...@tresata.com wrote: well, sort of! we make input/output formats (cascading taps,

Re: Spark 1.1.0 hbase_inputformat.py not work

2014-10-01 Thread Gilberto Lira
Thank you Zhang! I am grateful for your help! 2014-10-01 14:05 GMT-03:00 Kan Zhang kzh...@apache.org: CC user@ for indexing. Glad you fixed it. All source code for these examples are under SPARK_HOME/examples. For example, the converters used here are in

Re: IPython Notebook Debug Spam

2014-10-01 Thread Rick Richardson
I found the problem. I was manually constructing the CLASSPATH and SPARK_CLASSPATH because I needed jars for running the cassandra lib. For some reason that I cannot explain, it was this that was causing the issue. Maybe one of the jars had a log4j.properties rolled up in it? I removed almost

Creating a feature vector from text before using with MLLib

2014-10-01 Thread Soumya Simanta
I'm trying to understand the intuition behind the features method that Aaron used in one of his demos. I believe this feature will just work for detecting the character set (i.e., language used). Can someone help ? def featurize(s: String): Vector = { val n = 1000 val result = new

Re: still GC overhead limit exceeded after increasing heap space

2014-10-01 Thread 陈韵竹
Thanks Sean. This is how I set this memory. I set it when I start to run the job java -Xms64g -Xmx64g -cp /root/spark/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/root/scala/lib/scala-library.jar:./target/MyProject.jar MyClass Is there some problem with it? On Wed, Oct 1, 2014 at 2:03 PM, Sean

Re: still GC overhead limit exceeded after increasing heap space

2014-10-01 Thread Liquan Pei
Can you use spark submit to set the the executor memory? Take a look at https://spark.apache.org/docs/latest/submitting-applications.html. Liquan On Wed, Oct 1, 2014 at 2:21 PM, 陈韵竹 anny9...@gmail.com wrote: Thanks Sean. This is how I set this memory. I set it when I start to run the job

Determining number of executors within RDD

2014-10-01 Thread Akshat Aranya
Hi, I want implement an RDD wherein the decision of number of partitions is based on the number of executors that have been set up. Is there some way I can determine the number of executors within the getPartitions() call?

Re: run scalding on spark

2014-10-01 Thread Koert Kuipers
thanks On Wed, Oct 1, 2014 at 4:56 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Pretty cool, thanks for sharing this! I've added a link to it on the wiki: https://cwiki.apache.org/confluence/display/SPARK/Supplemental+Spark+Projects . Matei On Oct 1, 2014, at 1:41 PM, Koert Kuipers

Re: Creating a feature vector from text before using with MLLib

2014-10-01 Thread Liquan Pei
The program computes hashing bi-gram frequency normalized by total number of bigrams then filter out zero values. hashing is a effective trick of vectorizing features. Take a look at http://en.wikipedia.org/wiki/Feature_hashing Liquan On Wed, Oct 1, 2014 at 2:18 PM, Soumya Simanta

Re: MLlib Linear Regression Mismatch

2014-10-01 Thread Burak Yavuz
Hi, It appears that the step size is too high that the model is diverging with the added noise. Could you try by setting the step size to be 0.1 or 0.01? Best, Burak - Original Message - From: Krishna Sankar ksanka...@gmail.com To: user@spark.apache.org Sent: Wednesday, October 1,

Spark And Mapr

2014-10-01 Thread Addanki, Santosh Kumar
Hi We were using Horton 2.4.1 as our Hadoop distribution and now switched to MapR Previously to read a text file we would use : test = sc.textFile(\hdfs://10.48.101.111:8020/user/hdfs/test\) What would be the equivalent of the same for Mapr. Best Regards Santosh

Re: spark.driver.memory is not set (pyspark, 1.1.0)

2014-10-01 Thread Tamas Jambor
Thank you for the replies. It makes sense for scala/java, but in python the JVM is launched when the spark context is initialised, so it should be able to set it, I assume. On Wed, Oct 1, 2014 at 6:24 PM, Andrew Or and...@databricks.com wrote: Hi Tamas, Yes, Marcelo is right. The reason why

Re: Task deserialization problem using 1.1.0 for Hadoop 2.4

2014-10-01 Thread Timothy Potter
Forgot to mention that I've tested that SerIntWritable and PipelineDocumentWritable are serializable by serializing / deserializing to/from a byte array in memory. On Wed, Oct 1, 2014 at 1:43 PM, Timothy Potter thelabd...@gmail.com wrote: I'm running into the following deserialization issue when

Re: MLlib Linear Regression Mismatch

2014-10-01 Thread Krishna Sankar
Thanks Burak. Step size 0.01 worked for b) and step=0.0001 for c) ! Cheers k/ On Wed, Oct 1, 2014 at 3:00 PM, Burak Yavuz bya...@stanford.edu wrote: Hi, It appears that the step size is too high that the model is diverging with the added noise. Could you try by setting the step size to

Re: Spark And Mapr

2014-10-01 Thread Vladimir Rodionov
There is doc on MapR: http://doc.mapr.com/display/MapR/Accessing+MapR-FS+in+Java+Applications -Vladimir Rodionov On Wed, Oct 1, 2014 at 3:00 PM, Addanki, Santosh Kumar santosh.kumar.adda...@sap.com wrote: Hi We were using Horton 2.4.1 as our Hadoop distribution and now switched to MapR

RE: Spark And Mapr

2014-10-01 Thread Addanki, Santosh Kumar
Hi We would like to do this in PySpark Environment i.e something like test = sc.textFile(maprfs:///user/root/test) or test = sc.textFile(hdfs:///user/root/test) or Currently when we try test = sc.textFile(maprfs:///user/root/test) It throws the error No File-System for scheme: maprfs Best

Re: GraphX: Types for the Nodes and Edges

2014-10-01 Thread Oshi
Excellent! Thanks Andy. I will give it a go. On Thu, Oct 2, 2014 at 12:42 AM, andy petrella [via Apache Spark User List] ml-node+s1001560n15487...@n3.nabble.com wrote: I'll try my best ;-). 1/ you could create a abstract type for the types (1 on top of Vs, 1 other on top of Es types) than

Re: Spark And Mapr

2014-10-01 Thread Sungwook Yoon
Yes.. you should use maprfs:// I personally haven't used pyspark, I just used scala shell or standalone with MapR. I think you need to set classpath right, adding jar like /opt/mapr/hadoop/hadoop-0.20.2/lib/hadoop-0.20.2-dev-core.jar to the classpath in the classpath. Sungwook On Wed, Oct 1,

Re: Spark And Mapr

2014-10-01 Thread Matei Zaharia
It should just work in PySpark, the same way it does in Java / Scala apps. Matei On Oct 1, 2014, at 4:12 PM, Sungwook Yoon sy...@maprtech.com wrote: Yes.. you should use maprfs:// I personally haven't used pyspark, I just used scala shell or standalone with MapR. I think you need to

Multiple spark shell sessions

2014-10-01 Thread Sanjay Subramanian
hey guys I am using  spark 1.0.0+cdh5.1.0+41 When two users try to run spark-shell , the first guy's spark-shell shows active in the 18080 Web UI but the second user shows WAITING and the shell has a bunch of errors but does go the spark-shell and sc.master seems to point to the correct master.

Spark inside Eclipse

2014-10-01 Thread Sanjay Subramanian
hey guys Is there a way to run Spark in local mode from within Eclipse.I am running Eclipse Kepler on a Macbook Pro with MavericksLike one can run hadoop map/reduce applications from within Eclipse and debug and learn. thanks sanjay   

Re: Spark inside Eclipse

2014-10-01 Thread Ted Yu
Cycling bits: http://search-hadoop.com/m/JW1q5wxkXH/spark+eclipsesubj=Buidling+spark+in+Eclipse+Kepler On Wed, Oct 1, 2014 at 4:35 PM, Sanjay Subramanian sanjaysubraman...@yahoo.com.invalid wrote: hey guys Is there a way to run Spark in local mode from within Eclipse. I am running Eclipse

Re: Spark inside Eclipse

2014-10-01 Thread Ashish Jain
Hello Sanjay, This can be done, and is a very effective way to debug. 1) Compile and package your project to get a fat jar 2) In your SparkConf use setJars and give location of this jar. Also set your master here as local in SparkConf 3) Use this SparkConf when creating JavaSparkContext 4) Debug

Re: timestamp not implemented yet

2014-10-01 Thread barge.nilesh
Parquet format seems to be comparatively better for analytic load, it has performance compression benefits for large analytic workload. A workaround could be to use long datatype to store epoch timestamp value. If you already have existing parquet files (impala tables) then you may need to

Re: Spark And Mapr

2014-10-01 Thread Surendranauth Hiraman
As Sungwook said, the classpath pointing to the mapr jar is the key for that error. MapR has a Spark install that hopefully makes it easier. I don't have the instructions handy but you can asking their support about it. -Suren On Wed, Oct 1, 2014 at 7:18 PM, Matei Zaharia

Re: Multiple spark shell sessions

2014-10-01 Thread Matei Zaharia
You need to set --total-executor-cores to limit how many total cores it grabs on the cluster. --executor-cores is just for each individual executor, but it will try to launch many of them. Matei On Oct 1, 2014, at 4:29 PM, Sanjay Subramanian sanjaysubraman...@yahoo.com.INVALID wrote: hey

spark.cleaner.ttl

2014-10-01 Thread SK
Hi, I am using spark v 1.1.0. The default value of spark.cleaner.ttl is infinite as per the online docs. Since a lot of shuffle files are generated in /tmp/spark-local* and the disk is running out of space, we tested with a smaller value of ttl. However, even when job has completed and the timer

Re: Multiple spark shell sessions

2014-10-01 Thread Sanjay Subramanian
Awesome thanks a TON. It works There is a clash in the UI port initially but looks like it creates a second UI port at 4041 for the second user wanting to use the spark-shell 14/10/01 17:34:38 INFO JettyUtils: Failed to create UI at port, 4040. Trying again.14/10/01 17:34:38 INFO JettyUtils:

What can be done if a FlatMapFunctions generated more data that can be held in memory

2014-10-01 Thread Steve Lewis
I number of the problems I want to work with generate datasets which are too large to hold in memory. This becomes an issue when building a FlatMapFunction and also when the data used in combineByKey cannot be held in memory. The following is a simple, if a little silly, example of a

Help Troubleshooting Naive Bayes

2014-10-01 Thread Mike Bernico
Hi Everyone, I'm working on training mllib's Naive Bayes to classify TF/IDF vectoried docs using Spark 1.1.0. I've gotten this to work fine on a smaller set of data, but when I increase the number of vectorized documents I get hung up on training. The only messages I'm seeing are below. I'm

Re: org.apache.spark.sql.catalyst.errors.package$TreeNodeException:

2014-10-01 Thread Michael Armbrust
You are likely running into SPARK-3708 https://issues.apache.org/jira/browse/SPARK-3708, which was fixed by #2594 https://github.com/apache/spark/pull/2594 this morning. On Wed, Oct 1, 2014 at 8:09 AM, tonsat ton...@gmail.com wrote: We have a configuration CDH5.0,Spark1.1.0(stand alone) and

Re: Creating a feature vector from text before using with MLLib

2014-10-01 Thread Xiangrui Meng
Yes, the bigram in that demo only has two characters, which could separate different character sets. -Xiangrui On Wed, Oct 1, 2014 at 2:54 PM, Liquan Pei liquan...@gmail.com wrote: The program computes hashing bi-gram frequency normalized by total number of bigrams then filter out zero values.

Re: Help Troubleshooting Naive Bayes

2014-10-01 Thread Xiangrui Meng
The cost depends on the feature dimension, number of instances, number of classes, and number of partitions. Do you mind sharing those numbers? -Xiangrui On Wed, Oct 1, 2014 at 6:31 PM, Mike Bernico mike.bern...@gmail.com wrote: Hi Everyone, I'm working on training mllib's Naive Bayes to

Re: Print Decision Tree Models

2014-10-01 Thread Xiangrui Meng
Which Spark version are you using? It works in 1.1.0 but not in 1.0.0. -Xiangrui On Wed, Oct 1, 2014 at 2:13 PM, Jimmy McErlain ji...@sellpoints.com wrote: So I am trying to print the model output from MLlib however I am only getting things like the following:

Re: Print Decision Tree Models

2014-10-01 Thread Jimmy
Yeah I'm using 1.0.0 and thanks for taking the time to check! Sent from my iPhone On Oct 1, 2014, at 8:48 PM, Xiangrui Meng men...@gmail.com wrote: Which Spark version are you using? It works in 1.1.0 but not in 1.0.0. -Xiangrui On Wed, Oct 1, 2014 at 2:13 PM, Jimmy McErlain

Issue with Partitioning

2014-10-01 Thread Ankur Srivastava
Hi, I am using custom partitioner to partition my JavaPairRDD where key is a String. I use hashCode of the sub-string of the key to derive the partition index but I have noticed that my partition contains keys which have a different partitionIndex returned by the partitioner. Another issue I am