Re: Spark Metrics: custom source/sink configurations not getting recognized

2016-09-06 Thread map reduced
Hi, anyone has any ideas please? On Mon, Sep 5, 2016 at 8:30 PM, map reduced wrote: > Hi, > > I've written my custom metrics source/sink for my Spark streaming app and > I am trying to initialize it from metrics.properties - but that doesn't > work from executors. I don't

Re: distribute work (files)

2016-09-06 Thread ayan guha
To access local file, try with file:// URI. On Wed, Sep 7, 2016 at 8:52 AM, Peter Figliozzi wrote: > This is a great question. Basically you don't have to worry about the > details-- just give a wildcard in your call to textFile. See the Programming > Guide

Re: Getting memory error when starting spark shell but not often

2016-09-06 Thread Terry Hoo
Maybe not enough continues memory (10G?) in your host Regards, - Terry On Wed, Sep 7, 2016 at 10:51 AM, Divya Gehlot wrote: > Hi, > I am using EMR 4.7 with Spark 1.6 > Sometimes when I start the spark shell I get below error > > OpenJDK 64-Bit Server VM warning: INFO:

Getting memory error when starting spark shell but not often

2016-09-06 Thread Divya Gehlot
Hi, I am using EMR 4.7 with Spark 1.6 Sometimes when I start the spark shell I get below error OpenJDK 64-Bit Server VM warning: INFO: > os::commit_memory(0x0005662c, 10632822784, 0) failed; error='Cannot > allocate memory' (errno=12) > # > # There is insufficient memory for the Java

Re: Is it possible to submit Spark Application remotely?

2016-09-06 Thread tosaigan...@gmail.com
you can livy to submit the spark jobs to remotely http://gethue.com/how-to-use-the-livy-spark-rest-job-server-api-for-submitting-batch-jar-python-and-streaming-spark-jobs/ Regards, Sai Ganesh On Tue, Sep 6, 2016 at 1:24 PM, neil90 [via Apache Spark User List] <

Re: distribute work (files)

2016-09-06 Thread Peter Figliozzi
This is a great question. Basically you don't have to worry about the details-- just give a wildcard in your call to textFile. See the Programming Guide section entitled "External Datasets". The Spark framework will distribute your

Re: I noticed LinearRegression sometimes produces negative R^2 values

2016-09-06 Thread Evan Zamir
I am using the default setting for setting *fitIntercept*, which *should* be TRUE right? On Tue, Sep 6, 2016 at 1:38 PM Sean Owen wrote: > Are you not fitting an intercept / regressing through the origin? with > that constraint it's no longer true that R^2 is necessarily >

Re: Re[8]: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.

2016-09-06 Thread Yong Zhang
This is an interesting point. I tested with originally data with Spark 2.0 release, I can get the same statistic output in the originally email like following: 50 1.77695393562 51 0.695149898529 52 0.638142108917 53 0.647341966629 54 0.663456916809 55 0.629166126251 56 0.644149065018 57

Re: Q: Multiple spark streaming app, one kafka topic, same consumer group

2016-09-06 Thread Cody Koeninger
In general, see the material linked from https://github.com/koeninger/kafka-exactly-once if you want a better understanding of the direct stream. For spark-streaming-kafka-0-8, the direct stream doesn't really care about consumer group, since it uses the simple consumer. For the 0.10 version,

Difference between UDF and Transformer in Spark ML

2016-09-06 Thread janardhan shetty
Apart from creation of a new column what are the other differences between transformer and an udf in spark ML ?

Q: Multiple spark streaming app, one kafka topic, same consumer group

2016-09-06 Thread Mariano Semelman
Hello everybody, I am trying to understand how Kafka Direct Stream works. I'm interested in having a production ready Spark Streaming application that consumes a Kafka topic. But I need to guarantee there's (almost) no downtime, specially during deploys (and submit) of new versions. What it seems

Re: Spark ML 2.1.0 new features

2016-09-06 Thread janardhan shetty
Thanks Jacek. On Tue, Sep 6, 2016 at 1:44 PM, Jacek Laskowski wrote: > Hi, > > https://issues.apache.org/jira/browse/SPARK-17363?jql= > project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.0% > 20AND%20component%20%3D%20MLlib > > Pozdrawiam, > Jacek Laskowski > >

Re: Spark ML 2.1.0 new features

2016-09-06 Thread Jacek Laskowski
Hi, https://issues.apache.org/jira/browse/SPARK-17363?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.0%20AND%20component%20%3D%20MLlib Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark Follow me at

Re: I noticed LinearRegression sometimes produces negative R^2 values

2016-09-06 Thread Sean Owen
Are you not fitting an intercept / regressing through the origin? with that constraint it's no longer true that R^2 is necessarily nonnegative. It basically means that the errors are even bigger than what you'd get by predicting the data's mean value as a constant model. On Tue, Sep 6, 2016 at

Re: Spark transformations

2016-09-06 Thread janardhan shetty
Noticed few things about Spark transformers just wanted to be clear. Unary transformer: createTransformFunc: IN => OUT = { *item* => } Here *item *is single element and *NOT* entire column. I would like to get the number of elements in that particular column. Since there is *no forward

Getting figures from spark streaming

2016-09-06 Thread Ashok Kumar
===320160906-21250980.224686(null,3,20160906-212509,80.22468448052631637099)(null,1,20160906-212509,60.40695324215582386153)(null,4,20160906-212509,61.95159400693415572125)(null,2,20160906-212509,93.05912099305473237788)(null,5,20160906-212509,81.08637370113427387121) Now it does process the first values 3, 2016

Re: Spark ML 2.1.0 new features

2016-09-06 Thread janardhan shetty
Any links ? On Mon, Sep 5, 2016 at 1:50 PM, janardhan shetty wrote: > Is there any documentation or links on the new features which we can > expect for Spark ML 2.1.0 release ? >

Re: Is it possible to submit Spark Application remotely?

2016-09-06 Thread neil90
You need to pass --cluster-mode to spark-submit, this will push the driver to cluster rather then it run locally on your computer. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-submit-Spark-Application-remotely-tp27640p27668.html Sent

Re: I noticed LinearRegression sometimes produces negative R^2 values

2016-09-06 Thread Nick Pentreath
That does seem strange. Can you provide an example to reproduce? On Tue, 6 Sep 2016 at 21:49 evanzamir wrote: > Am I misinterpreting what r2() in the LinearRegression Model summary means? > By definition, R^2 should never be a negative number! > > > > -- > View this

I noticed LinearRegression sometimes produces negative R^2 values

2016-09-06 Thread evanzamir
Am I misinterpreting what r2() in the LinearRegression Model summary means? By definition, R^2 should never be a negative number! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/I-noticed-LinearRegression-sometimes-produces-negative-R-2-values-tp27667.html

Re: Reset auto.offset.reset in Kafka 0.10 integ

2016-09-06 Thread Cody Koeninger
You don't want auto.offset.reset on executors, you want executors to do what the driver told them to do. Otherwise you're going to get really horrible data inconsistency issues if the executors silently reset. If your retention is so low that retention gets expired in between when the driver

Re: Reset auto.offset.reset in Kafka 0.10 integ

2016-09-06 Thread Srikanth
This isn't a production setup. We kept retention low intentionally. My original question was why I got the exception instead of it using auto.offset.reset on restart? On Tue, Sep 6, 2016 at 10:48 AM, Cody Koeninger wrote: > If you leave enable.auto.commit set to true, it

Complex RDD operation as DataFrame UDF ?

2016-09-06 Thread Thunder Stumpges
Hi guys, Spark 1.6.1 here. I am trying to "DataFrame-ize" a complex function I have that currently operates on a DataSet, and returns another DataSet with a new "column" added to it. I'm trying to fit this into the new ML "Model" format where I can receive a DataFrame, ensure the input column

Re: Re[8]: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.

2016-09-06 Thread Davies Liu
I think the slowness is caused by generated aggregate method has more than 8K bytecodes, than it's not JIT compiled, became much slower. Could you try to disable the DontCompileHugeMethods by: -XX:-DontCompileHugeMethods On Mon, Sep 5, 2016 at 4:21 AM, Сергей Романов

Re: Using spark package XGBoost

2016-09-06 Thread janardhan shetty
Is this merged to Spark ML ? If so which version ? On Tue, Sep 6, 2016 at 12:58 AM, Takeshi Yamamuro wrote: > Hi, > > Sorry to bother you, but I'ld like to inform you our activities. > We'll start incubating our product, Hivemall, in Apache and this is a > scalable ML

Datasets and Partitioners

2016-09-06 Thread Darin McBeath
How do you find the partitioner for a Dataset? I have a Dataset (om) which I created and repartitioned using one of the fields (docId). Reading the documentation, I would assume the om Dataset should be hash partitioned. But, how can I verify this? When I do om.rdd.partitioner I get

Spark 1.6.0 web console shows a running application in a "waiting" status, but it's actually running. Is this an existing bug?

2016-09-06 Thread sarlindo
I have 2 questions/issues. 1. We had the spark-master shut down (reason unknown) we looked at the spark-master logs and it simply shows this, is there some other log I should be looking at to find out why the master went down? 16/09/05 21:10:00 INFO ClientCnxn: Opening socket connection to

Re: YARN memory overhead settings

2016-09-06 Thread Marcelo Vanzin
It kinda depends on the application. Certain compression libraries, in particular, are kinda lax with their use of off-heap buffers, so if you configure executors to use many cores you might end up with higher usage than the default configuration. Then there are also things like PARQUET-118. In

Spray Client VS PlayWS vs Spring RestTemplate within Spark Job

2016-09-06 Thread prosp4300
Hi, Spark Users As I know, Spray Client depends on Akka ActorSystem, is this dependency theoretically means it is not possible to use spray-client in Spark Job which run from Spark Executor nodes I believe PlayWS should works as a Restful client to run from Spark Executor, how about

YARN memory overhead settings

2016-09-06 Thread Tim Moran
Hi, I'm running a spark job on YARN, using 6 executors each with 25 GB of memory and spark.yarn.executor.overhead set to 5GB. Despite this, I still seem to see YARN killing my executors for exceeding the memory limit. Reading the docs, it looks like the overhead defaults to around 10% of the

anyone know what the status of spark-ec2 is?

2016-09-06 Thread Andy Davidson
Spark-ec2 used to be part of the spark distribution. It now seems to be split into a separate repo https://github.com/amplab/spark-ec2 It does not seem to be listed on https://spark-packages.org/ Does anyone know what the status is? There is a readme.md how ever I am unable to find any release

spark 1.6.0 web console shows running application in a "waiting" status, but it's acutally running

2016-09-06 Thread sarlindo
I have 2 questions/issues. 1. We had the spark-master shut down (reason unknown) we looked at the spark-master logs and it simply shows this, is there some other log I should be looking at to find out why the master went down? 16/09/05 21:10:00 INFO ClientCnxn: Opening socket connection to

distribute work (files)

2016-09-06 Thread Lydia Ickler
Hi, maybe this is a stupid question: I have a list of files. Each file I want to take as an input for a ML-algorithm. All files are independent from another. My question now is how do I distribute the work so that each worker takes a block of files and just runs the algorithm on them one by

Re: Reset auto.offset.reset in Kafka 0.10 integ

2016-09-06 Thread Cody Koeninger
If you leave enable.auto.commit set to true, it will commit offsets to kafka, but you will get undefined delivery semantics. If you just want to restart from a fresh state, the easiest thing to do is use a new consumer group name. But if that keeps happening, you should look into why your

Re: Dataframe, Java: How to convert String to Vector ?

2016-09-06 Thread Peter Figliozzi
Hi Yan, I think you'll have to map the features column to a new numerical features column. Here's one way to do the individual transform: scala> val x = "[1, 2, 3, 4, 5]" x: String = [1, 2, 3, 4, 5] scala> val y:Array[Int] = x slice(1, x.length - 1) replace(",", "") split(" ") map(_.toInt) y:

Total memory of workers

2016-09-06 Thread tan shai
Hello, Can anyone explain to me the behavior of spark if the size of the processed file is greater than the total memory available on workers? Many thanks.

Re: clear steps for installation of spark, cassandra and cassandra connector to run on spyder 2.3.7 using python 3.5 and anaconda 2.4 ipython 4.0

2016-09-06 Thread ayan guha
Spark has pretty extensive documentation, that should be your starting point. I do not use Cassandra much, but Cassandra connector should be a spark package, so look for spark package website. If I may say so, all docs should be one or two Google search away :) On 6 Sep 2016 20:34, "muhammet

Re: [Spark submit] getting error when use properties file parameter in spark submit

2016-09-06 Thread Divya Gehlot
Yes I am reading from s3 bucket .. Strangely the error goes off when I remove the properties girl parameter . On Sep 6, 2016 8:35 PM, "Sonal Goyal" wrote: > Looks like a classpath issue - Caused by: java.lang.ClassNotFoundException: > com.amazonaws.services.s3.AmazonS3 >

Re: [Spark submit] getting error when use properties file parameter in spark submit

2016-09-06 Thread Sonal Goyal
Looks like a classpath issue - Caused by: java.lang.ClassNotFoundException: com.amazonaws.services.s3.AmazonS3 Are you using S3 somewhere? Are the required jars in place? Best Regards, Sonal Founder, Nube Technologies Reifier at Strata Hadoop World

LabeledPoint creation

2016-09-06 Thread Madabhattula Rajesh Kumar
Hi, I am new to Spark ML, trying to create a LabeledPoint from categorical dataset(example code from spark). For this, I am using One-hot encoding feature. Below is my code val df = sparkSession.createDataFrame(Seq( (0, "a"), (1, "b"), (2,

RE: Spark 2.0.0 Thrift Server problem with Hive metastore

2016-09-06 Thread Campagnola, Francesco
The same error occurs when executing any “explain” command: 0: jdbc:hive2://spark-test:1> explain select 1 as id; java.lang.IllegalStateException: Can't overwrite cause with java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast to

Spark Checkpoint for JDBC/ODBC

2016-09-06 Thread Selvam Raman
Hi, Need your input to take decision. We have an n-number of databases(ie oracle, MySQL,etc). I want to read a data from the sources but how it is maintaining fault tolerance in source side. if source side system went down. how the spark system reads the data. -- Selvam Raman "லஞ்சம்

[Spark submit] getting error when use properties file parameter in spark submit

2016-09-06 Thread Divya Gehlot
Hi, I am getting below error if I try to use properties file paramater in spark-submit Exception in thread "main" java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider org.apache.hadoop.fs.s3a.S3AFileSystem could not be instantiated at

[Spark-Submit:]Error while reading from s3n

2016-09-06 Thread Divya Gehlot
Hi, I am on EMR 4.7 with Spark 1.6.1 I am trying to read from s3n buckets in spark Option 1 : If I set up hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3.S3FileSystem") hadoopConf.set("fs.s3.awsSecretAccessKey", sys.env("AWS_SECRET_ACCESS_KEY")) hadoopConf.set("fs.s3.awsAccessKeyId",

Re: Scala Vs Python

2016-09-06 Thread 刘虓
Hi, I have been using spark-sql with python for more than one year from ver 1.5.0 to ver 2.0.0, It works great so far,the performance is always great,though I have not done the benchmark yet. also I have skimmed through source code of python api,most of it only calls scala api,nothing heavily is

Re: Scala Vs Python

2016-09-06 Thread Leonard Cohen
hi spark user, IMHO, I will use the language for application aligning with the language under which the system designed. If working on Spark, I choose Scala. If working on Hadoop, I choose Java. If working on nothing, I use Python. Why? Because it will save my life, just kidding. Best

clear steps for installation of spark, cassandra and cassandra connector to run on spyder 2.3.7 using python 3.5 and anaconda 2.4 ipython 4.0

2016-09-06 Thread muhammet pakyürek
could u send me documents and links to satisfy all above requirements of installation of spark, cassandra and cassandra connector to run on spyder 2.3.7 using python 3.5 and anaconda 2.4 ipython 4.0

Re: How to convert String to Vector ?

2016-09-06 Thread Leonard Cohen
hi, map(feature => List(feature).split(',') ) in python: list(string.split(',')) : eval(string) http://stackoverflow.com/questions/31376574/spark-rddstring-string-into-rddmapstring-string -- Original -- From: "??(Yan Facai)";; Send

How to convert String to Vector ?

2016-09-06 Thread Yan Facai
Hi, I have a csv file like: uid mid features label 1235231[0, 1, 3, ...]True Both "features" and "label" columns are used for GBTClassifier. However, when I read the file: Dataset samples = sparkSession.read().csv(file); The type of samples.select("features") is

RE: How to make the result of sortByKey distributed evenly?

2016-09-06 Thread AssafMendelson
I imagine this is a sample example to explain a bigger concern. In general when you do a sort by key, it will implicitly shuffle the data by the key. Since you have 1 key (0) with 1 and the other with just 1 record it will simply shuffle it into two very skewed partitions. One way you can

Re: How to make the result of sortByKey distributed evenly?

2016-09-06 Thread Fridtjof Sander
Your data has only two keys, and basically all values are assigned to only one of them. There is no better way to distribute the keys, than the one Spark executes. What you have to do is to use different keys to sort and range-partition on. Try to invoke sortBy() on a non-pair-RDD. This will

How to make the result of sortByKey distributed evenly?

2016-09-06 Thread Zhang, Liyun
Hi all: I have a question about RDD.sortByKey val n=2 val sorted=sc.parallelize(2 to n).map(x=>(x/n,x)).sortByKey() sorted.saveAsTextFile("hdfs://bdpe42:8020/SkewedGroupByTest") sc.parallelize(2 to n).map(x=>(x/n,x)) will generate pairs like [(0,2),(0,3),.,(0,1),(1,2)], the

Re: Spark 2.0.0 Thrift Server problem with Hive metastore

2016-09-06 Thread Chanh Le
Did anyone use STS of Spark 2.0 on production? For me I still waiting for the compatible in parquet file created by Spark 1.6.1 > On Sep 6, 2016, at 2:46 PM, Campagnola, Francesco > wrote: > > I mean I have installed Spark 2.0 in the same environment where

RE: Spark 2.0.0 Thrift Server problem with Hive metastore

2016-09-06 Thread Campagnola, Francesco
I mean I have installed Spark 2.0 in the same environment where Spark 1.6 thrift server was running, then stopped the Spark 1.6 thrift server and started the Spark 2.0 one. If I’m not mistaken, Spark 2.0 should be still compatible with Hive 1.2.1 and no upgrade procedures are required. The

Consuming parquet files built with version 1.8.1

2016-09-06 Thread Dinesh Narayanan
Hello, I have some parquet files generated with 1.8.1 through an MR job that i need to consume. I see that master is built with parquet 1.8.1 but i get this error(with master branch) java.lang.NoSuchMethodError:

Dataframe, Java: How to convert String to Vector ?

2016-09-06 Thread Yan Facai
Hi, I have a csv file like: uid mid features label 1235231[0, 1, 3, ...]True Both "features" and "label" columns are used for GBTClassifier. However, when I read the file: Dataset samples = sparkSession.read().csv(file); The type of samples.select("features") is

Re: SPARK ML- Feature Selection Techniques

2016-09-06 Thread DB Tsai
You can try LOR with L1. Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Mon, Sep 5, 2016 at 5:31 AM, Bahubali Jain wrote: > Hi, > Do we have any feature selection techniques

Re: Any estimate for a Spark 2.0.1 release date?

2016-09-06 Thread Takeshi Yamamuro
Oh, sorry. I forgot attaching an URL; https://www.mail-archive.com/user@spark.apache.org/msg55723.html // maropu On Tue, Sep 6, 2016 at 2:41 PM, Morten Hornbech wrote: > Sorry. Seen what? I think you forgot a link. > > Morten > > Den 6. sep. 2016 kl. 04.51 skrev Takeshi