RE: Re: Spark assembly in Maven repo?

2015-12-14 Thread Xiaoyong Zhu
Thanks for the info! Xiaoyong From: Sean Owen [mailto:so...@cloudera.com] Sent: Monday, December 14, 2015 12:20 AM To: Xiaoyong Zhu Cc: user Subject: Re: Re: Spark assembly in Maven repo? Yes, though I think the Maven Central repository is more

Re: [SparkR] Is rdd in SparkR deprecated ?

2015-12-14 Thread Jeff Zhang
Thanks Felix, Just curious when I read the code. On Tue, Dec 15, 2015 at 1:32 AM, Felix Cheung wrote: > RDD API in SparkR is not officially supported. You could still access them > with the SparkR::: prefix though. > > May I ask what uses you have for them? Would the

Re: Problems w/YARN Spark Streaming app reading from Kafka

2015-12-14 Thread Robert Towne
Cody, sorry I didn’t get back sooner, I never saw the response pass by. I was looking at the spark ui. I’ll see if I can recreate the issue w/version 1.5.2. Thanks.. From: Cody Koeninger > Date: Friday, October 16, 2015 at 12:48 To: robert towne

Re: Discover SparkUI port for spark streaming job running in cluster mode

2015-12-14 Thread Ted Yu
w.r.t. getting application Id, please take a look at the following in SparkContext : /** * A unique identifier for the Spark application. * Its format depends on the scheduler implementation. * (i.e. * in case of local spark app something like 'local-1433865536131' * in case of

Database does not exist: (Spark-SQL ===> Hive)

2015-12-14 Thread Gokula Krishnan D
Hello All - I tried to execute a Spark-Scala Program in order to create a table in HIVE and faced couple of error so I just tried to execute the "show tables" and "show databases" And I have already created a database named "test_db".But I have encountered the error "Database does not exist"

Re: Spark Streaming having trouble writing checkpoint

2015-12-14 Thread Robert Towne
I forgot to include the data node logs for this time period: 2015-12-14 00:14:52,836 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: server51:50010:DataXceiver error processing unknown operation src: /127.0.0.1:39442 dst: /127.0.0.1:50010 java.io.EOFException at

Re: Discover SparkUI port for spark streaming job running in cluster mode

2015-12-14 Thread Jonathan Kelly
Oh, nice, I did not know about that property. Thanks! On Mon, Dec 14, 2015 at 4:28 PM, Ted Yu wrote: > w.r.t. getting application Id, please take a look at the following > in SparkContext : > > /** >* A unique identifier for the Spark application. >* Its format

Spark Streaming having trouble writing checkpoint

2015-12-14 Thread Robert Towne
I have a Spark Streaming app (1.5.2 compile for hadoop 2.6) that occasionally has problem writing its checkpoint file. This is a YARN (yarn cluster) app running as user mapred. What I see in my streaming app logs are: App log App log 15/12/15 00:14:08 server51:

Mllib Word2Vec vector representations are very high in value

2015-12-14 Thread jxieeducation
Hi, For Word2Vec in Mllib, when I use a large number of partitions (e.g. 256), my vectors turn out to be very large. I am looking for a representation that is between (-1, 1) like all other Word2Vec implementations (e.g. Gensim, google's Word2Vec). E.g. scala> var m =

Re: Spark streaming driver java process RSS memory constantly increasing using cassandra driver

2015-12-14 Thread Conor Fennell
Just bumping the issue I am having, if anyone can provide direction? I have been stuck on this for a while now. Thanks, Conor On Fri, Dec 11, 2015 at 5:10 PM, Conor Fennell wrote: > Hi, > > I have a memory leak in the spark driver which is not in the heap or > the

RuntimeException: Failed to check null bit for primitive int type

2015-12-14 Thread zml张明磊
Hi, My spark version is spark-1.4.1-bin-hadoop2.6. When I submit a spark job and read data from hive table. Getting the following error. Although it’s just a WARN. But it’s leading to the job failure. Maybe the following jira has solved. So, I am confusing.

[SparkR] Is rdd in SparkR deprecated ?

2015-12-14 Thread Jeff Zhang
>From the source code of SparkR, seems SparkR support rdd api. But there's no documentation on that. ( http://spark.apache.org/docs/latest/sparkr.html ) So I guess it is deprecated, is that right ? -- Best Regards Jeff Zhang

Re: Autoscaling of Spark YARN cluster

2015-12-14 Thread Deepak Sharma
An approach I can think of is using Ambari Metrics Service(AMS) Using these metrics , you can decide upon if the cluster is low in resources. If yes, call the Ambari management API to add the node to the cluster. Thanks Deepak On Mon, Dec 14, 2015 at 2:48 PM, cs user

manipulate schema inside a repeated column

2015-12-14 Thread Samuel
Hi, I ma having issues trying to rename or move subcolumns when they are insdie a repeated structure. Given a certain schema, I can create a different layout to provide an alternative view. For exaple, I can move one column and put it inside a subcolumn, and add an extra literal field, just for

Kryo serialization fails when using SparkSQL and HiveContext

2015-12-14 Thread Linh M. Tran
Hi everyone, I'm using HiveContext and SparkSQL to query a Hive table and doing join operation on it. After changing the default serializer to Kryo with spark.kryo.registrationRequired = true, the Spark application failed with the following error: java.lang.IllegalArgumentException: Class is not

Re: Unsubsribe

2015-12-14 Thread Akhil Das
Send an email to user-unsubscr...@spark.apache.org to unsubscribe from the list. See more over http://spark.apache.org/community.html Thanks Best Regards 2015-12-09 22:18 GMT+05:30 Michael Nolting : > cancel > > -- >

Re: Autoscaling of Spark YARN cluster

2015-12-14 Thread cs user
Hi Mingyu, I'd be interested in hearing about anything else you find which might meet your needs for this. One way perhaps this could be done would be to use Ambari. Ambari comes with a nice api which you can use to add additional nodes into a cluster:

Spark parallelism with mapToPair

2015-12-14 Thread Rabin Banerjee
Hi Team, I am new to spark Streaming , I am trying to write a spark Streaming application , where the Calculation of incoming data will be performed in "R" in the micro batching . But I want to make wordCounts.mapToPair parallel where wordCounts is the output of groupByKey, How can I ensure

Run ad-hoc queries at runtime against cached RDDs

2015-12-14 Thread Krishna Rao
Hi all, What's the best way to run ad-hoc queries against a cached RDDs? For example, say I have an RDD that has been processed, and persisted to memory-only. I want to be able to run a count (actually "countApproxDistinct") after filtering by an, at compile time, unknown (specified by query)

Re: Can't filter

2015-12-14 Thread Akhil Das
If you are not using Spark submit to run the job, then you need to add the following line: sc.addJar("target/scala_2.11/spark.jar") After creating the SparkContext, where the spark.jar is your project jar. Thanks Best Regards On Thu, Dec 10, 2015 at 5:29 PM, Бобров Виктор wrote:

Re: How to change StreamingContext batch duration after loading from checkpoint

2015-12-14 Thread Akhil Das
Taking the values from a configuration file rather hard-coding in the code might help, haven't tried it though. Thanks Best Regards On Mon, Dec 7, 2015 at 9:53 PM, yam wrote: > Is there a way to change the streaming context batch interval after > reloading > from

Spark streaming: java.lang.ClassCastException: org.apache.spark.util.SerializableConfiguration ... on restart from checkpoint

2015-12-14 Thread alberskib
Hey all, When my streaming application is restarting from failure (from checkpoint) I am receiving strange error: java.lang.ClassCastException: org.apache.spark.util.SerializableConfiguration cannot be cast to com.example.sender.MyClassReporter. Instance of B class is created on driver side

RE: Spark streaming driver java process RSS memory constantly increasing using cassandra driver

2015-12-14 Thread Singh, Abhijeet
Ok, I get it now. You might be right it is possible the Cassandra driver might be leaking some memory. The driver might have some open sockets or stuff like that. However if it is a native memory leak issue I would suggest try an alternative driver or proof that this is indeed the problem

Re: Spark streaming: java.lang.ClassCastException: org.apache.spark.util.SerializableConfiguration ... on restart from checkpoint

2015-12-14 Thread Ted Yu
Can you show the complete stack trace for the ClassCastException ? Please see the following thread: http://search-hadoop.com/m/q3RTtgEUHVmJA1T1 Cheers On Mon, Dec 14, 2015 at 7:33 AM, alberskib wrote: > Hey all, > > When my streaming application is restarting from failure

Re: Problem using User Defined Predicate pushdown with core RDD and parquet - UDP class not found

2015-12-14 Thread chao chu
+spark user mailing list Hi there, I have exactly the same problem as mentioned below. My current work around is to add the jar containing my UDP in one of the system classpath (for example, put it under the same path as

Re: how to make a dataframe of Array[Doubles] ?

2015-12-14 Thread Jeff Zhang
Please use tuple instead of array. ( the element must implement trait Product if you want to convert RDD to DF) val testvec = Array( (1.0, 2.0, 3.0, 4.0), (5.0, 6.0, 7.0, 8.0)) On Tue, Dec 15, 2015 at 1:12 PM, AlexG wrote: > My attempts to create a dataframe of

how to make a dataframe of Array[Doubles] ?

2015-12-14 Thread AlexG
My attempts to create a dataframe of Array[Doubles], I get an error about RDD[Array[Double]] not having a toDF function: import sqlContext.implicits._ val testvec = Array( Array(1.0, 2.0, 3.0, 4.0), Array(5.0, 6.0, 7.0, 8.0)) val testrdd = sc.parallelize(testvec) testrdd.toDF gives :29: error:

Linear Regression with OLS

2015-12-14 Thread Arunkumar Pillai
Hi I need an exmaple for Linear Regression using OLS val data = sc.textFile("data/mllib/ridge-data/lpsa.data") val parsedData = data.map { line => val parts = line.split(',') LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble))) }.cache() // Building the model

mapValues Transformation (JavaPairRDD)

2015-12-14 Thread Sushrut Ikhar
Hi, I am finding it difficult to understand the following problem : I count the number of records before and after applying the mapValues transformation for a JavaPairRDD. As expected the number of records were same before and after. Now, I counted number of distinct keys before and after

Re: Database does not exist: (Spark-SQL ===> Hive)

2015-12-14 Thread Jeff Zhang
Do you put hive-site.xml on the classpath ? On Tue, Dec 15, 2015 at 11:14 AM, Gokula Krishnan D wrote: > Hello All - > > > I tried to execute a Spark-Scala Program in order to create a table in > HIVE and faced couple of error so I just tried to execute the "show tables" >

what are the cons/drawbacks of a Spark DataFrames

2015-12-14 Thread email2...@gmail.com
Hello All - I've started using the Spark DataFrames and looks like it provides rich column level operations and functions. In the same time, I would like to understand are there any drawbacks / cons of using a DataFrames?. If so please share your experience on that. Thanks, Gokul -- View

Re: HDFS

2015-12-14 Thread Akhil Das
Try to set the spark.locality.wait to a higher number and see if things change. You can read more about the configuration properties from here http://spark.apache.org/docs/latest/configuration.html#scheduling Thanks Best Regards On Sat, Dec 12, 2015 at 12:16 AM, shahid ashraf

Unsubscribe

2015-12-14 Thread Roman Garcia
Cancel Unsubscribe

How do I link JavaEsSpark.saveToEs() to a sparkConf?

2015-12-14 Thread Spark Enthusiast
Folks, I have the following program : SparkConf conf = new SparkConf().setMaster("local").setAppName("Indexer").set("spark.driver.maxResultSize", "2g");conf.set("es.index.auto.create", "true");conf.set("es.nodes", "localhost");conf.set("es.port", "9200");conf.set("es.write.operation",

RE: Spark streaming driver java process RSS memory constantly increasing using cassandra driver

2015-12-14 Thread Singh, Abhijeet
Hi Conor, What do you mean when you say leak is not in "Heap or non-Heap". If it is not heap related than it has to be the native memory that is leaking. I can't say for sure but you do have Threads working there and that could be using the native memory. We didn't get any pics of JConsole.

why "cache table a as select * from b" will do shuffle,and create 2 stages.

2015-12-14 Thread ant2nebula
why "cache table a as select * from b" will do shuffle,and create 2 stages. example: table "ods_pay_consume" is from "KafkaUtils.createDirectStream" hiveContext.sql("cache table dwd_pay_consume as select * from ods_pay_consume") this code will make 2 statges of DAG

Re: Spark streaming driver java process RSS memory constantly increasing using cassandra driver

2015-12-14 Thread Conor Fennell
Hi Abhijeet, Thanks for pointing out the pics are not showing. I have put all the images in this public google document: https://docs.google.com/document/d/1xEJ0eTtXBlSso6SshLCWZHcRw4aEQMJflzKBsr2FHaw/edit?usp=sharing All the code is in the first email; there is nothing else starting up threads

Re: How do I link JavaEsSpark.saveToEs() to a sparkConf?

2015-12-14 Thread Ali Gouta
You don't need an explicit association between your JavaEsSpark and the SparkConf. Actuall when you will make transformations/filtering/.. on your "sc" then you can strore the final RDD in your ELS. Example: val generateRDD = sc.makeRDD(Seq(SOME_STUFF)) JavaEsSpark.saveToEs(generateRDD, "foo");

Re: memory leak when saving Parquet files in Spark

2015-12-14 Thread Matt K
Thanks Cheng! I'm running 1.5. After setting the following, I'm no longer seeing this issue: sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false") Thanks, -Matt On Fri, Dec 11, 2015 at 1:58 AM, Cheng Lian wrote: > This is probably caused by schema

Re: memory leak when saving Parquet files in Spark

2015-12-14 Thread Cheng Lian
Thanks for the feedback, Matt! Yes, we've also seen other feedback about slow Parquet summary file generation, especially when appending a small dataset to an existing large dataset. Disabling it is a reasonable workaround since the summary files are no longer important after parquet-mr 1.7.

Re: Run ad-hoc queries at runtime against cached RDDs

2015-12-14 Thread Krishna Rao
Thanks for the response Jörn. So to elaborate, I have a large dataset with userIds, each tagged with a property, e.g.: user_1prop1=X user_2prop1=Yprop2=A user_3prop2=B I would like to be able to get the number of distinct users that have a particular property (or combination of

How to Make Spark Streaming DStream as SQL table?

2015-12-14 Thread MK
Hi, The aim here is as follows: - read data from Socket using Spark Streaming every N seconds - register received data as SQL table - there will be more data read from HDFS etc as reference data, they will also be registered as SQL tables - the idea is to perform arbitrary SQL

RE: How to save Multilayer Perceptron Classifier model.

2015-12-14 Thread Ulanov, Alexander
Hi Vadim, As Yanbo pointed out, that feature is not yet merged into the main branch. However, there is a hacky workaround: // save model sc.parallelize(Seq(model), 1).saveAsObjectFile("path") // load model val sameModel = sc.objectFile[YourCLASS]("path").first() Best regards, Alexander From:

Re: Run ad-hoc queries at runtime against cached RDDs

2015-12-14 Thread Jörn Franke
Can you elaborate a little bit more on the use case? It looks a little bit like an abuse of Spark in general . Interactive queries that are not suitable for in-memory batch processing might be better supported by ignite that has in-memory indexes, concept of hot, warm, cold data etc. or hive on

Re: Classpath problem trying to use DataFrames

2015-12-14 Thread Christopher Brady
Thanks for the response. I lost access to my cluster over the weekend, so I had to wait until today to check. All of the correct Hive jars are in classpath.txt. Also, this error seems to be happening in the driver rather than the executors. It's running in yarn-client mode, so it should use

Re: Spark streaming: java.lang.ClassCastException: org.apache.spark.util.SerializableConfiguration ... on restart from checkpoint

2015-12-14 Thread Bartłomiej Alberski
Below is the full stacktrace(real names of my classes were changed) with short description of entries from my code: rdd.mapPartitions{ partition => //this is the line to which second stacktrace entry is pointing val sender = broadcastedValue.value // this is the maing place to which first

UNSUBSCRIBE

2015-12-14 Thread Tim Barthram
UNSUBSCRIBE Thanks _ The information transmitted in this message and its attachments (if any) is intended only for the person or entity to which it is addressed. The message may contain confidential and/or privileged

Discover SparkUI port for spark streaming job running in cluster mode

2015-12-14 Thread Ashish Nigam
Hi, I run spark streaming job in cluster mode. This means that driver can run in any data node. And Spark UI can run in any dynamic port. At present, I know about the port by looking at container logs that look something like this - server.AbstractConnector: Started

Saving to JDBC

2015-12-14 Thread Bob Corsaro
Is there anyway to map pyspark.sql.Row columns to JDBC table columns, or do I have to just put them in the right order before saving? I'm using code like this: ``` rdd = rdd.map(lambda i: Row(name=i.name, value=i.value)) sqlCtx.createDataFrame(rdd).write.jdbc(dbconn_string, tablename,

worker:java.lang.ClassNotFoundException: ttt.test$$anonfun$1

2015-12-14 Thread Bonsen
package ttt import org.apache.spark.SparkConf import org.apache.spark.SparkContext object test { def main(args: Array[String]) { val conf = new SparkConf() conf.setAppName("mytest") .setMaster("spark://Master:7077") .setSparkHome("/usr/local/spark")

Re: Re: HELP! I get "java.lang.String cannot be cast to java.lang.Intege " for a long time.

2015-12-14 Thread Jakob Odersky
> Sorry,I'm late.I try again and again ,now I use spark 1.4.0 ,hadoop 2.4.1.but I also find something strange like this : > http://apache-spark-user-list.1001560.n3.nabble.com/worker-java-lang-ClassNotFoundException-ttt-test-anonfun-1-td25696.html > (if i use "textFile",It can't run.) In the

Re: Re: HELP! I get "java.lang.String cannot be cast to java.lang.Intege " for a long time.

2015-12-14 Thread Jakob Odersky
sorry typo, I meant *without* the addJar On 14 December 2015 at 11:13, Jakob Odersky wrote: > > Sorry,I'm late.I try again and again ,now I use spark 1.4.0 ,hadoop > 2.4.1.but I also find something strange like this : > > > >

Adding a UI Servlet Filter

2015-12-14 Thread iamknome
Hello all, I am trying to setup a UI filter for the Web UI and trying to add my customer auth servlet filter to the worker and master processes. I have added the extraClasspath option and have it pointed to my custom JAR but when the worker or master starts it keeps complaining about

Strange Set of errors

2015-12-14 Thread Steve Lewis
I am running on a spark 1.5.1 cluster managed by Mesos - I have an application that handled a chemistry problem which can be increased by increasing the number of atoms - increasing the number of Spark stages. I do a repartition at each stage - Stage 9 is the last stage. At each stage the size and

SparkML algos limitations question.

2015-12-14 Thread Eugene Morozov
Hello! I'm currently working on POC and try to use Random Forest (classification and regression). I also have to check SVM and Multiclass perceptron (other algos are less important at the moment). So far I've discovered that Random Forest has a limitation of maxDepth for trees and just out of

Re: Autoscaling of Spark YARN cluster

2015-12-14 Thread Mingyu Kim
Cool. Using Ambari to monitor and scale up/down the cluster sounds promising. Thanks for the pointer! Mingyu From: Deepak Sharma Date: Monday, December 14, 2015 at 1:53 AM To: cs user Cc: Mingyu Kim , "user@spark.apache.org"

Re: [SparkR] Any reason why saveDF's mode is append by default ?

2015-12-14 Thread Shivaram Venkataraman
I think its just a bug -- I think we originally followed the Python API (in the original PR [1]) but the Python API seems to have been changed to match Scala / Java in https://issues.apache.org/jira/browse/SPARK-6366 Feel free to open a JIRA / PR for this. Thanks Shivaram [1]

Re: Re: Spark assembly in Maven repo?

2015-12-14 Thread Sean Owen
Yes, though I think the Maven Central repository is more canonical. http://repo1.maven.org/maven2/org/apache/spark/spark-core_2.10/1.5.2/ On Mon, Dec 14, 2015, 06:35 Xiaoyong Zhu wrote: > Thanks! do you mean something here (for example for 1.5.1 using scala > 2.10)? > >

Re: UNSUBSCRIBE

2015-12-14 Thread Mithila Joshi
unsubscribe On Mon, Dec 14, 2015 at 4:49 PM, Tim Barthram wrote: > UNSUBSCRIBE Thanks > > > > _ > > The information transmitted in this message and its attachments (if any) > is intended > only for the

Re: Concatenate a string to a Column of type string in DataFrame

2015-12-14 Thread Michael Armbrust
In earlier versions you should be able to use callUdf or callUDF (depending on which version) and call the hive function "concat". On Sun, Dec 13, 2015 at 3:05 AM, Yanbo Liang wrote: > Sorry, it was added since 1.5.0. > > 2015-12-13 2:07 GMT+08:00 Satish

Re: RuntimeException: Failed to check null bit for primitive int type

2015-12-14 Thread Michael Armbrust
Your code (at com.ctrip.ml.toolimpl.MetadataImpl$$anonfun$1.apply(MetadataImpl.scala:22)) needs to check isNullAt before calling getInt. This is because you cannot return null for a primitive value (Int). On Mon, Dec 14, 2015 at 3:40 AM, zml张明磊 wrote: > Hi, > > > >

troubleshooting "Missing an output location for shuffle"

2015-12-14 Thread Veljko Skarich
Hi, I keep getting some variation of the following error: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 2 Does anyone know what this might indicate? Is it a memory issue? Any general guidance appreciated.

Re: Discover SparkUI port for spark streaming job running in cluster mode

2015-12-14 Thread Jonathan Kelly
Are you running Spark on YARN? If so, you can get to the Spark UI via the YARN ResourceManager. Each running Spark application will have a link on the YARN ResourceManager labeled "ApplicationMaster". If you click that, it will take you to the Spark UI, even if it is running on a slave node in the

Re: Kryo serialization fails when using SparkSQL and HiveContext

2015-12-14 Thread Michael Armbrust
You'll need to either turn off registration (spark.kryo.registrationRequired) or create a custom register spark.kryo.registrator http://spark.apache.org/docs/latest/configuration.html#compression-and-serialization On Mon, Dec 14, 2015 at 2:17 AM, Linh M. Tran wrote: >

Re: Use of rdd.zipWithUniqueId() in DStream

2015-12-14 Thread Shixiong Zhu
It doesn't guarantee that. E.g., scala> sc.parallelize(Seq(1.0, 2.0, 3.0, 4.0), 2).filter(_ > 2.0).zipWithUniqueId().collect().foreach(println) (3.0,1) (4.0,3) It only guarantees "unique". Best Regards, Shixiong Zhu 2015-12-13 10:18 GMT-08:00 Sourav Mazumder : >

Re: troubleshooting "Missing an output location for shuffle"

2015-12-14 Thread Ross.Cramblit
Hey Velijko, I ran into this error a few days ago and it turned out I was out of disk space on a couple nodes. I am not sure if this was the direct cause of the error, but it stopped throwing when I cleared out some unneeded large files. On Dec 14, 2015, at 5:32 PM, Veljko Skarich

Re: [SparkR] Any reason why saveDF's mode is append by default ?

2015-12-14 Thread Jeff Zhang
Thanks Shivaram, created https://issues.apache.org/jira/browse/SPARK-12318 I will work on it. On Mon, Dec 14, 2015 at 4:13 PM, Shivaram Venkataraman < shiva...@eecs.berkeley.edu> wrote: > I think its just a bug -- I think we originally followed the Python > API (in the original PR [1]) but the

Autoscaling of Spark YARN cluster

2015-12-14 Thread Mingyu Kim
Hi all, Has anyone tried out autoscaling Spark YARN cluster on a public cloud (e.g. EC2) based on workload? To be clear, I¹m interested in scaling the cluster itself up and down by adding and removing YARN nodes based on the cluster resource utilization (e.g. # of applications queued, # of

ALS mllib.recommendation vs ml.recommendation

2015-12-14 Thread Roberto Pagliari
Currently, there are two implementations of ALS available: ml.recommendation.ALS and