Re: Fwd: Model weights of linear regression becomes abnormal values

2015-05-29 Thread Petar Zecevic
You probably need to scale the values in the data set so that they are all of comparable ranges and translate them so that their means get to 0. You can use pyspark.mllib.feature.StandardScaler(True, True) object for that. On 28.5.2015. 6:08, Maheshakya Wijewardena wrote: Hi, I'm trying

Re: Spark1.3.1 build issue with CDH5.4.0 getUnknownFields

2015-05-29 Thread trackissue121
I had already tested query in Hive CLI and it works fine. Same query shows error in Spark SQL. On May 29, 2015 4:14 AM, ayan guha guha.a...@gmail.com wrote: Probably a naive question: can you try the same in hive CLI and see if your SQL is working? Looks like hive thing to me as spark is

Re: Adding an indexed column

2015-05-29 Thread Wesley Miao
One way I can see is to - 1. get rdd from your df 2. call rdd.zipWithIndex to get a new rdd 3. turn your new rdd to a new df On Fri, May 29, 2015 at 5:43 AM, Cesar Flores ces...@gmail.com wrote: Assuming that I have the next data frame: flag | price -- 1

Re: Spark Streaming and Drools

2015-05-29 Thread Antonio Giambanco
Hi all, I wrote a simple rule (Drools) and I'm trying to fire it, when I fireAllRules nothing happen neither exceptions. . . do I need to setup configurations? Thanks A G 2015-05-22 12:22 GMT+02:00 Dibyendu Bhattacharya dibyendu.bhattach...@gmail.com: Hi, Sometime back I played with

Re: MLlib: how to get the best model with only the most significant explanatory variables in LogisticRegressionWithLBFGS or LogisticRegressionWithSGD ?

2015-05-29 Thread mélanie gallois
When will Spark 1.4 be available exactly? To answer to Model selection can be achieved through high lambda resulting lots of zero in the coefficients : Do you mean that putting a high lambda as a parameter of the logistic regression keeps only a few significant variables and deletes the others

Spark Executor Memory Usage

2015-05-29 Thread Valerii Moisieienko
Hello! My name is Valerii. I have noticed strange memory behaivour of Spark's executor on my cluster. Cluster works in standalone mode with 3 workers. Application runs in cluster mode. From topology configuration spark.executor.memory 1536m I checked heap usage via JVisualVM:

Re: Execption writing on two cassandra tables NoHostAvailableException: All host(s) tried for query failed (no host was tried)

2015-05-29 Thread Yana Kadiyska
are you able to connect to your cassandra installation via cassandra_home$ ./bin/cqlsh This exception generally means that your cassandra instance is not reachable/accessible On Fri, May 29, 2015 at 6:11 AM, Antonio Giambanco antogia...@gmail.com wrote: Hi all, I have in a single server

Re: Batch aggregation by sliding window + join

2015-05-29 Thread Igor Berman
Hi Ayan, thanks for the response I'm using 1.3.1. I'll check window queries(I dont use spark-sql...only core, might be I should?) What do you mean by materialized? I can repartitionAndSort by key daily-aggregation, however I'm not quite understand how it will help with yesterdays block which needs

dataframe cumulative sum

2015-05-29 Thread Cesar Flores
What will be the more appropriate method to add a cumulative sum column to a data frame. For example, assuming that I have the next data frame: flag | price -- 1|47.808764653746 1|47.808764653746 1|31.9869279512204 How can I create a data frame with an extra

Re: [Streaming] Configure executor logging on Mesos

2015-05-29 Thread Gerard Maas
Hi Tim, Thanks for the info. We (Andy Petrella and myself) have been diving a bit deeper into this log config: The log line I was referring to is this one (sorry, I provided the others just for context) *Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties* That

Re: Spark1.3.1 build issue with CDH5.4.0 getUnknownFields

2015-05-29 Thread Chen Song
Regarding the build itself, hadoop-2.6 is not even a valid profile. I got the following WARNING for my build. [WARNING] The requested profile hadoop-2.6 could not be activated because it does not exist. Chen On Fri, May 29, 2015 at 2:38 AM, trackissue121 trackissue...@gmail.com wrote: I had

Python implementation of RDD interface

2015-05-29 Thread Sven Kreiss
I wanted to share a Python implementation of RDDs: pysparkling. http://trivial.io/post/120179819751/pysparkling-is-a-native-implementation-of-the The benefit is that you can apply the same code that you use in PySpark on large datasets in pysparkling on small datasets or single documents. When

Re: Spark SQL v MemSQL/Voltdb

2015-05-29 Thread Conor Doherty
Hi Ashish, Transactions are a big difference between Spark SQL and MemSQL/VoltDB, but there are other differences as well. I'm not an expert on Volt, but another difference between Spark SQL and MemSQL is that DataFrames do not support indexes and MemSQL tables do. This will have implications for

Re: Spark Executor Memory Usage

2015-05-29 Thread Ted Yu
For #2, see http://unix.stackexchange.com/questions/65835/htop-reporting-much-higher-memory-usage-than-free-or-top Cheers On Fri, May 29, 2015 at 6:56 AM, Valerii Moisieienko valeramoisee...@gmail.com wrote: Hello! My name is Valerii. I have noticed strange memory behaivour of Spark's

Re: Batch aggregation by sliding window + join

2015-05-29 Thread ayan guha
My point is if you keep daily aggregates already computed then you do not reprocess raw data. But yuh you may decide to recompute last 3 days everyday. On 29 May 2015 23:52, Igor Berman igor.ber...@gmail.com wrote: Hi Ayan, thanks for the response I'm using 1.3.1. I'll check window queries(I

Re: SparkR Jobs Hanging in collectPartitions

2015-05-29 Thread Eskilson,Aleksander
Sure. Looking more closely at the code, I thought I might have had an error in the flow of data structures in the R code, the line that extracts the words from the corpus is now, words - distinct(SparkR:::flatMap(corpus function(line) { strsplit( gsub(“^\\s+|[[:punct:]]”, “”, tolower(line)),

Re: spark java.io.FileNotFoundException: /user/spark/applicationHistory/application

2015-05-29 Thread igor.berman
in yarn your executors might run on every node in your cluster, so you need to configure spark history to be on hdfs(so it will be accessible to every executor) probably you've switched from local to yarn mode when submitting -- View this message in context:

Re: Spark1.3.1 build issue with CDH5.4.0 getUnknownFields

2015-05-29 Thread Alex Robbins
I've gotten that error when something is trying to use a different version of protobuf than you want. Maybe check out a `mvn dependency:tree` to see if someone is trying to use something other than libproto 2.5.0. (At least, 2.5.0 was current when I was having the problem) On Fri, May 29, 2015 at

Format RDD/SchemaRDD contents to screen?

2015-05-29 Thread Minnow Noir
Im trying to debug query results inside spark-shell, but finding it cumbersome to save to file and then use file system utils to explore the results, and .foreach(print) tends to interleave the results among the myriad log messages. Take() and collect() truncate. Is there a simple way to present

Re: Is anyone using Amazon EC2? (second attempt!)

2015-05-29 Thread roni
Hi , Any update on this? I am not sure if the issue I am seeing is related .. I have 8 slaves and when I created the cluster I specified ebs volume with 100G. I see on Ec2 8 volumes created and each attached to the corresponding slave. But when I try to copy data on it , it complains that

spark-sql errors

2015-05-29 Thread Sanjay Subramanian
https://groups.google.com/a/cloudera.org/forum/#!topic/cdh-user/6SqGuYemnbc

Re: Python implementation of RDD interface

2015-05-29 Thread Davies Liu
DPark also can work in localhost without Mesos cluster (single thread or multiple process). I also think that running PySpark without JVM in local mode will help develop, so both pysparkling and DPark are both useful. On Fri, May 29, 2015 at 1:36 PM, Sven Kreiss s...@svenkreiss.com wrote: I

Re: Is anyone using Amazon EC2? (second attempt!)

2015-05-29 Thread Sanjay Subramanian
I use spark on EC2 but it's a CDH 5.3.3 distribution (starving developer version) installed thru Cloudera Manager. Spark is configured to run on Yarn. Regards Sanjay Sent from my iPhone On May 29, 2015, at 6:16 PM, roni roni.epi...@gmail.com wrote: Hi , Any update on this? I am not

Re: Format RDD/SchemaRDD contents to screen?

2015-05-29 Thread ayan guha
Depending on your spark version, you can convert schemaRDD to a dataframe and then use .show() On 30 May 2015 10:33, Minnow Noir minnown...@gmail.com wrote: Im trying to debug query results inside spark-shell, but finding it cumbersome to save to file and then use file system utils to explore

RE: Official Docker container for Spark

2015-05-29 Thread Tridib Samanta
Thanks all for your reply. I was evaluating which one fits best for me. I picked epahomov/docker-spark from docker registry and suffice my need. Thanks Tridib Date: Fri, 22 May 2015 14:15:42 +0530 Subject: Re: Official Docker container for Spark From: riteshoneinamill...@gmail.com To:

Security,authorization and governance

2015-05-29 Thread Phani Yadavilli -X (pyadavil)
Hi Team, Is there any opensource framework/tool for providing security authorization and data governance to spark. Regards Phani Kumar

Re: dataframe cumulative sum

2015-05-29 Thread Yin Huai
Hi Cesar, We just added it in Spark 1.4. In Spark 1.4, You can use window function in HiveContext to do it. Assuming you want to calculate the cumulative sum for every flag, import org.apache.spark.sql.expressions.Window import org.apache.spark.sql.functions._ df.select( $flag, $price,

Re: SparkR Jobs Hanging in collectPartitions

2015-05-29 Thread Shivaram Venkataraman
For jobs with R UDFs (i.e. when we use the RDD API from SparkR) we use R on both the driver side and on the worker side. So in this case when the `flatMap` operation is run, the data is sent from the JVM to an R process on the worker which in turn executes the `gsub` function. Could you turn on

Re: Python implementation of RDD interface

2015-05-29 Thread Sven Kreiss
I have to admit that I never ran DPark. I think the goals are very different. The purpose of pysparkling is not to reproduce Spark on a cluster, but to have a lightweight implementation with the same interface to run locally or on an API server. I still run PySpark on a cluster to preprocess a

Re: Python implementation of RDD interface

2015-05-29 Thread Davies Liu
There is another implementation of RDD interface in Python, called DPark [1], Could you have a few words to compare these two? [1] https://github.com/douban/dpark/ On Fri, May 29, 2015 at 8:29 AM, Sven Kreiss s...@svenkreiss.com wrote: I wanted to share a Python implementation of RDDs:

Anybody using Spark SQL JDBC server with DSE Cassandra?

2015-05-29 Thread Mohammed Guller
Hi - We have successfully integrated Spark SQL with Cassandra. We have a backend that provides a REST API that allows users to execute SQL queries on data in C*. Now we would like to also support JDBC/ODBC connectivity , so that user can use tools like Tableau to query data in C* through the

Re: Execption writing on two cassandra tables NoHostAvailableException: All host(s) tried for query failed (no host was tried)

2015-05-29 Thread Antonio Giambanco
sure I can, everything is on localhost . . . . it only happens when i want to write in two or more tables in the same schema A G 2015-05-29 16:10 GMT+02:00 Yana Kadiyska yana.kadiy...@gmail.com: are you able to connect to your cassandra installation via cassandra_home$ ./bin/cqlsh This