Re: Spark on YARN multitenancy

2015-12-15 Thread Ben Roling
I'm curious to see the feedback others will provide. My impression is the only way to get Spark to give up resources while it is idle would be to use the preemption feature of the scheduler you're using in YARN. When another user comes along the scheduler would preempt one or more Spark

Re: Spark on YARN multitenancy

2015-12-15 Thread Ashwin Sai Shankar
We run large multi-tenant clusters with spark/hadoop workloads, and we use 'yarn's preemption'/'spark's dynamic allocation' to achieve multitenancy. See following link on how to enable/configure preemption using fair scheduler :

Securing objects on the thrift server

2015-12-15 Thread Younes Naguib
Hi all, I get this error when running "show current roles;" : 2015-12-15 15:50:41 WARN org.apache.hive.service.cli.thrift.ThriftCLIService ThriftCLIService:681 - Error fetching results: org.apache.hive.service.cli.HiveSQLException: Couldn't find log associated with operation handle:

Re: how to spark streaming application start working on next batch before completing on previous batch .

2015-12-15 Thread Mukesh Jha
Try setting *spark*.streaming.*concurrent*. *jobs* to number of concurrent jobs you want to run. On 15 Dec 2015 17:35, "ikmal" wrote: > The best practice is to set batch interval lesser than processing time. I'm > sure your application is suffering from constantly

Re: Spark on YARN multitenancy

2015-12-15 Thread Ben Roling
Oops - I meant while it is *busy* when I said while it is *idle*. On Tue, Dec 15, 2015 at 11:35 AM Ben Roling wrote: > I'm curious to see the feedback others will provide. My impression is the > only way to get Spark to give up resources while it is idle would be to use >

RE: Securing objects on the thrift server

2015-12-15 Thread Younes Naguib
The one coming with spark 1.5.2. y From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: December-15-15 1:59 PM To: Younes Naguib Cc: user@spark.apache.org Subject: Re: Securing objects on the thrift server Which Hive release are you using ? Please take a look at HIVE-8529 Cheers On Tue, Dec 15,

Re: mapValues Transformation (JavaPairRDD)

2015-12-15 Thread Sushrut Ikhar
Well the issue was because I was using some non thread-safe functions for generating the key. Regards, Sushrut Ikhar [image: https://]about.me/sushrutikhar On Tue, Dec 15, 2015 at 2:27 PM, Paweł Szulc wrote: > Hard to

Re: Securing objects on the thrift server

2015-12-15 Thread Ted Yu
Which Hive release are you using ? Please take a look at HIVE-8529 Cheers On Tue, Dec 15, 2015 at 8:25 AM, Younes Naguib < younes.nag...@tritondigital.com> wrote: > Hi all, > > I get this error when running "show current roles;" : > > 2015-12-15 15:50:41 WARN >

UDAF support in PySpark?

2015-12-15 Thread Wei Chen
Hi, I am wondering if there is UDAF support in PySpark with Spark 1.5. If not, is Spark 1.6 going to incorporate that? Thanks, Wei

UDAF support in PySpark?

2015-12-15 Thread Wei Chen
Hi, I am wondering if there is UDAF support in PySpark with Spark 1.5. If not, is Spark 1.6 going to incorporate that? Thanks, Wei

Re: ALS mllib.recommendation vs ml.recommendation

2015-12-15 Thread Bryan Cutler
Hi Roberto, 1. How do they differ in terms of performance? They both use alternating least squares matrix factorization, the main difference is ml.recommendation.ALS uses DataFrames as input which has built-in optimizations and should give better performance 2. Am I correct to assume

Re: what are the cons/drawbacks of a Spark DataFrames

2015-12-15 Thread Andy Davidson
My understanding is one of the biggest advantages of DF¹s is that schema information allows a lot of optimization. For example assume frame had many column but your computation only uses 2 columns. No need to load all the data. Andy From: "email2...@gmail.com" Date:

Re: how to make a dataframe of Array[Doubles] ?

2015-12-15 Thread Michael Armbrust
You don't have to turn your array into a tuple, but you do need to have a product that wraps it (this is how we get names for the columns). case class MyData(data: Array[Double]) val df = Seq(MyData(Array(1.0, 2.0, 3.0, 4.0)), ...).toDF() On Mon, Dec 14, 2015 at 9:35 PM, Jeff Zhang

Can't create UDF through thriftserver, no error reported

2015-12-15 Thread Antonio Piccolboni
Hi, I am trying to create a UDF using the thiftserver. I followed this example , which is originally for hive. My understanding is that the thriftserver creates a hivecontext and Hive UDFs should be supported. I then sent this query to the thriftserver (I

Re: Securing objects on the thrift server

2015-12-15 Thread Todd Nist
see https://issues.apache.org/jira/browse/SPARK-11043, it is resolved in 1.6. On Tue, Dec 15, 2015 at 2:28 PM, Younes Naguib < younes.nag...@tritondigital.com> wrote: > The one coming with spark 1.5.2. > > > > y > > > > *From:* Ted Yu [mailto:yuzhih...@gmail.com] > *Sent:* December-15-15 1:59 PM

Spark big rdd problem

2015-12-15 Thread Eran Witkon
When running val data = sc.wholeTextFile("someDir/*") data.count() I get numerous warning from yarn till I get aka association exception. Can someone explain what happen when spark loads this rdd and can't fit it all in memory? Based on the exception it looks like the server is disconnecting from

Re: About Spark On Hbase

2015-12-15 Thread Zhan Zhang
If you want dataframe support, you can refer to https://github.com/zhzhan/shc, which I am working on to integrate to HBase upstream with existing support. Thanks. Zhan Zhang On Dec 15, 2015, at 4:34 AM, censj > wrote: hi,fight fate Did I can in

Re: mapValues Transformation (JavaPairRDD)

2015-12-15 Thread Paweł Szulc
Hard to imagine. Can you share a code sample? On Tue, Dec 15, 2015 at 8:06 AM, Sushrut Ikhar wrote: > Hi, > I am finding it difficult to understand the following problem : > I count the number of records before and after applying the mapValues > transformation for a

Mixing Long Run Periodic Update Jobs With Streaming Scoring

2015-12-15 Thread atbrew
Hi, I have a periodic retraining of a long running job (a decision tree trained on a large amount of historical data) that needs retrained on a daily/weekly/long period basis. These models are used in spark streaming to score incoming data, I would like to understand what is best practice

Re: Database does not exist: (Spark-SQL ===> Hive)

2015-12-15 Thread amarouni
Can you test with latest version of spark ? I had the same issue with 1.3 and it was resolved 1.5. On 15/12/2015 04:31, Jeff Zhang wrote: > Do you put hive-site.xml on the classpath ? > > On Tue, Dec 15, 2015 at 11:14 AM, Gokula Krishnan D > >

Re: how to spark streaming application start working on next batch before completing on previous batch .

2015-12-15 Thread ikmal
The best practice is to set batch interval lesser than processing time. I'm sure your application is suffering from constantly increasing of scheduling delay. -- View this message in context:

Re: default parallelism and mesos executors

2015-12-15 Thread Adrian Bridgett
Thanks Iulian, I'll retest with 1.6.x once it's released (probably won't have enough spare time to test with the RC). On 11/12/2015 15:00, Iulian Dragoș wrote: On Wed, Dec 9, 2015 at 4:29 PM, Adrian Bridgett > wrote: (resending,

Comparison of serialized objects

2015-12-15 Thread Max
Hi All, I’m currently trying to compare the types String (java.lang.String) and Text (org.apache.hadoop.io.Text) in the topic of comparison of serialized objects on Spark. Either of the types should be used as key, so this might be relevant in the following use-cases: a)RDD.saveAsObjectFile

Unable to get json for application jobs in spark 1.5.0

2015-12-15 Thread rakesh rakshit
Hi all, I am trying to get the json for all the jobs running within my application whose UI port is 4040. I am making an HTTP GET request at the following URI: http://172.26.32.143:4040/api/v1/applications/gauravpipe/jobs But getting the following service unavailable exception: Error 503

Cluster mode dependent jars not working

2015-12-15 Thread vimal dinakaran
I am running spark using cluster mode for deployment . Below is the command JARS=$JARS_HOME/amqp-client-3.5.3.jar,$JARS_HOME/nscala-time_2.10-2.0.0.jar,\ $JARS_HOME/kafka_2.10-0.8.2.1.jar,$JARS_HOME/kafka-clients-0.8.2.1.jar,\ $JARS_HOME/spark-streaming-kafka_2.10-1.4.1.jar,\

Re: About Spark On Hbase

2015-12-15 Thread Ted Yu
There is also http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase FYI On Tue, Dec 15, 2015 at 11:51 AM, Zhan Zhang wrote: > If you want dataframe support, you can refer to > https://github.com/zhzhan/shc, which I am working on to integrate to > HBase

How to do map join in Spark SQL

2015-12-15 Thread Alexander Pivovarov
I have big folder having ORC files. Files have duration field (e.g. 3,12,26, etc) Also I have small json file (just 8 rows) with ranges definition (min, max , name) 0, 10, A 10, 20, B 20, 30, C etc Because I can not do equi-join btw duration and range min/max I need to do cross join and apply

Re: ideal number of executors per machine

2015-12-15 Thread Jakob Odersky
Hi Veljko, I would assume keeping the number of executors per machine to a minimum is best for performance (as long as you consider memory requirements as well). Each executor is a process that can run tasks in multiple threads. On a kernel/hardware level, thread switches are much cheaper than

Re: Hive on Spark - Error: Child process exited before connecting back

2015-12-15 Thread Ophir Etzion
Hi, the versions are spark 1.3.0 and hive 1.1.0 as part of cloudera 5.4.3. I find it weird that it would work only on the version you mentioned as there is documentation (not good documentation but still..) on how to do it with cloudera that packages different versions. Thanks for the answer

Re: UDAF support in PySpark?

2015-12-15 Thread Ted Yu
Please watch SPARK-10915 FYI On Tue, Dec 15, 2015 at 11:45 AM, Wei Chen wrote: > Hi, > > I am wondering if there is UDAF support in PySpark with Spark 1.5. If not, > is Spark 1.6 going to incorporate that? > > Thanks, > Wei >

Re: Spark big rdd problem

2015-12-15 Thread Zhan Zhang
You should be able to get the logs from yarn by “yarn logs -applicationId xxx”, where you can possible find the cause. Thanks. Zhan Zhang On Dec 15, 2015, at 11:50 AM, Eran Witkon wrote: > When running > val data = sc.wholeTextFile("someDir/*") data.count() > > I get

Hive on Spark - Error: Child process exited before connecting back

2015-12-15 Thread Ophir Etzion
Hi, when trying to do Hive on Spark on CDH5.4.3 I get the following error when trying to run a simple query using spark. I've tried setting everything written here ( https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started) as well as what the cdh recommends. any one

ideal number of executors per machine

2015-12-15 Thread Veljko Skarich
Hi, I'm looking for suggestions on the ideal number of executors per machine. I run my jobs on 64G 32 core machines, and at the moment I have one executor running per machine, on the spark standalone cluster. I could not find many guidelines for figuring out the ideal number of executors; the

java.lang.NoSuchMethodError while saving a random forest model Spark version 1.5

2015-12-15 Thread Rachana Srivastava
I have recently upgraded spark version but when I try to run save a random forest model using model save command I am getting nosuchmethoderror. My code works fine with 1.3x version. model.save(sc.sc(), "modelsavedir"); ERROR:

Re: how to spark streaming application start working on next batch before completing on previous batch .

2015-12-15 Thread Tathagata Das
Just to be clear. spark.treaming.concurrentJobs is NOT officially supported. There are issues with fault-tolerance and data loss if that is set to more than 1. On Tue, Dec 15, 2015 at 9:19 AM, Mukesh Jha wrote: > Try setting *spark*.streaming.*concurrent*. *jobs* to

Re: what are the cons/drawbacks of a Spark DataFrames

2015-12-15 Thread Jakob Odersky
With DataFrames you loose type-safety. Depending on the language you are using this can also be considered a drawback. On 15 December 2015 at 15:08, Jakob Odersky wrote: > By using DataFrames you will not need to specify RDD operations explicity, > instead the operations are

Re: About Spark On Hbase

2015-12-15 Thread Josh Mahonin
And as yet another option, there is https://phoenix.apache.org/phoenix_spark.html It however requires that you are also using Phoenix in conjunction with HBase. On Tue, Dec 15, 2015 at 4:16 PM, Ted Yu wrote: > There is also >

hiveContext: storing lookup of partitions

2015-12-15 Thread Gourav Sengupta
Hi, I have a HIVE table with few thousand partitions (based on date and time). It takes a long time to run if for the first time and then subsequently it is fast. Is there a way to store the cache of partition lookups so that every time I start a new SPARK instance (cannot keep my personal

PairRDD(K, L) to multiple files by key serializing each value in L before

2015-12-15 Thread Daniel Valdivia
Hello everyone, I have a PairRDD with a set of key and list of values, each value in the list is a json which I already loaded beginning of my spark app, how can I iterate over each value of the list in my pair RDD to transform it to a string then save the whole content of the key to a file?

Re: what are the cons/drawbacks of a Spark DataFrames

2015-12-15 Thread Jakob Odersky
By using DataFrames you will not need to specify RDD operations explicity, instead the operations are built and optimized for by using the information available in the DataFrame's schema. The only draw-back I can think of is some loss of generality: given a dataframe containing types A, you will

Re: Mixing Long Run Periodic Update Jobs With Streaming Scoring

2015-12-15 Thread Tathagata Das
One general advice I can provide is if you wish to run the batch jobs concurrently to spark streaming jobs, then you should to put then in different fair scheduling pools, and prioritize the streaming pool, to minimize the streaming jobs from being impacted by the batch jobs. See spark docs online

Re: Spark parallelism with mapToPair

2015-12-15 Thread Tathagata Das
Since mapToPair will be called on each record, and the # records can be tens or millions, you probably do not want to run ALL of them in parallel. So think about your strategy here. In general the parallelism can be controlled by setting the number of partitions in the groupByKey operation. On

Re: ideal number of executors per machine

2015-12-15 Thread Jerry Lam
Hi Veljko, I usually ask the following questions: “how many memory per task?” then "How many cpu per task?” then I calculate based on the memory and cpu requirements per task. You might be surprise (maybe not you, but at least I am :) ) that many OOM issues are actually because of this. Best

Re: State management in spark-streaming

2015-12-15 Thread Tathagata Das
Well, the trackStateByKey has been renamed to mapWithState in upcoming 1.6. And regarding the usecase, you can easily implement this with updateStateByKey. See https://github.com/apache/spark/blob/branch-1.5/examples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.scala

security testing on spark ?

2015-12-15 Thread Judy Nash
Hi all, Does anyone know of any effort from the community on security testing spark clusters. I.e. Static source code analysis to find security flaws Penetration testing to identify ways to compromise spark cluster Fuzzing to crash spark Thanks, Judy

Pros and cons -Saving spark data in hive

2015-12-15 Thread Divya Gehlot
Hi, I am new bee to Spark and I am exploring option and pros and cons which one will work best in spark and hive context.My dataset inputs are CSV files, using spark to process the my data and saving it in hive using hivecontext 1) Process the CSV file using spark-csv package and create

YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

2015-12-15 Thread zml张明磊
Yesterday night, I run the jar on my pseudo-distributed mode without WARN and ERROR. However, Today, Getting the WARN and directly leading to the ERROR below. My computer memory is 8GB and I think it’s not the issue as the LOG WARN describe. What ‘s wrong ? The code haven’t change yet. And the

Re: YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

2015-12-15 Thread Jeff Zhang
>>> *15/12/16 10:22:01 WARN cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources* That means you don't have resources for your application, please check your hadoop web ui. On Wed, Dec 16,

Re: How to keep long running spark-shell but avoid hitting Java Out of Memory Exception: PermGen Space

2015-12-15 Thread Ted Yu
Can you post the class names in your graph ? After zooming in the picture, I can only see the package of the first class: scala.collection.immutable BTW which release of Spark are you using ? Cheers On Tue, Dec 15, 2015 at 6:13 PM, yunshan wrote: > Hi, > > Recently, I

Benchmarking with multiple users in Spark

2015-12-15 Thread Rajesh Balamohan
Hi, I am currently using spark 1.5.2 and I have been able to run benchmarks in spark (SQL specifically) in single user mode. For benchmarking with multiple users, I have tried some of the following approaches, but each has its own disadvantage 1. Start thrift server in Spark. - Execute

Re: Spark big rdd problem

2015-12-15 Thread Eran Witkon
If the problem is containers trying to use more memory then they allowed, how do I limit them? I all ready have executor-memory 5G Eran On Tue, 15 Dec 2015 at 23:10 Zhan Zhang wrote: > You should be able to get the logs from yarn by “yarn logs -applicationId > xxx”, where

about spark on hbase

2015-12-15 Thread censj
hi,all: how cloud I through spark function hbase get value then update this value and put this value to hbase ?

Re: About Spark On Hbase

2015-12-15 Thread censj
hi,fight fate Did I can in bulkPut() function use Get value first ,then put this value to Hbase ? > 在 2015年12月9日,16:02,censj 写道: > > Thank you! I know >> 在 2015年12月9日,15:59,fightf...@163.com 写道: >> >> If you are using maven , you

Re: SparkML algos limitations question.

2015-12-15 Thread Joseph Bradley
Hi Eugene, The maxDepth parameter exists because the implementation uses Integer node IDs which correspond to positions in the binary tree. This simplified the implementation. I'd like to eventually modify it to avoid depending on tree node IDs, but that is not yet on the roadmap. There is not

Re: Can't create UDF through thriftserver, no error reported

2015-12-15 Thread Jeff Zhang
It should be resolved by this ticket https://issues.apache.org/jira/browse/SPARK-11191 On Wed, Dec 16, 2015 at 3:14 AM, Antonio Piccolboni wrote: > Hi, > I am trying to create a UDF using the thiftserver. I followed this example >

looking for Spark streaming unit example written in Java

2015-12-15 Thread Andy Davidson
I am having a heck of a time writing a simple Junit test for my spark streaming code. The best code example I have been able to find is http://mkuthan.github.io/blog/2015/03/01/spark-unit-testing/ unfortunately it is written in Spock and Scala. I am having trouble figuring out how to get it to

Re: how to spark streaming application start working on next batch before completing on previous batch .

2015-12-15 Thread Mukesh Jha
Are the issues related to wal based KafkaReliableReceivers or with any receiver in general. Any insights will be helpful. On 16 Dec 2015 05:44, "Tathagata Das" wrote: > Just to be clear. spark.treaming.concurrentJobs is NOT officially > supported. There are issues with

looking for a easier way to count the number of items in a JavaDStream

2015-12-15 Thread Andy Davidson
I am writing a JUnit test for some simple streaming code. I want to make assertions about how many things are in a given JavaDStream. I wonder if there is an easier way in Java to get the count? I think there are two points of friction. 1. is it easy to create an accumulator of type double or

Re: looking for Spark streaming unit example written in Java

2015-12-15 Thread Ted Yu
Have you taken a look at streaming/src/test//java/org/apache/spark/streaming/JavaAPISuite.java ? JavaDStream stream = ssc.queueStream(rdds); JavaTestUtils.attachTestOutputStream(stream); FYI On Tue, Dec 15, 2015 at 6:36 PM, Andy Davidson < a...@santacruzintegration.com> wrote: > I am

Re: hiveContext: storing lookup of partitions

2015-12-15 Thread Jeff Zhang
>>> Currently it takes around 1.5 hours for me just to cache in the partition information and after that I can see that the job gets queued in the SPARK UI. I guess you mean the stage of getting the split info. I suspect it might be your cluster issue (or metadata store), unusually it won't take

Re: Pros and cons -Saving spark data in hive

2015-12-15 Thread Sabarish Sasidharan
If all you want to do is to load data into Hive, you don't need to use Spark. For subsequent query performance you would want to convert to ORC or Parquet when loading into Hive. Regards Sab On 16-Dec-2015 7:34 am, "Divya Gehlot" wrote: > Hi, > I am new bee to Spark

Intercept in Linear Regression

2015-12-15 Thread Arunkumar Pillai
How to get intercept in Linear Regression Model? LinearRegressionWithSGD.train(parsedData, numIterations) -- Thanks and Regards Arun

Spark on YARN multitenancy

2015-12-15 Thread David Fox
Hello Spark experts, We are currently evaluating Spark on our cluster that already supports MRv2 over YARN. We have noticed a problem with running jobs concurrently, in particular that a running Spark job will not release its resources until the job is finished. Ideally, if two people run any

Re: Cluster mode dependent jars not working

2015-12-15 Thread Ted Yu
Please use --conf spark.executor.extraClassPath=XXX to specify dependent jars. On Tue, Dec 15, 2015 at 3:57 AM, vimal dinakaran wrote: > I am running spark using cluster mode for deployment . Below is the command > > >

Re: [Spark 1.5]: Exception in thread "broadcast-hash-join-2" java.lang.OutOfMemoryError: Java heap space -- Work in 1.4, but 1.5 doesn't

2015-12-15 Thread Deenar Toraskar
On 16 December 2015 at 06:19, Deenar Toraskar < deenar.toras...@thinkreactive.co.uk> wrote: > Hi > > I had the same problem. There is a query with a lot of small tables (5x) > all below the broadcast threshold and Spark is broadcasting all these > tables together without checking if there is

Re: Spark big rdd problem

2015-12-15 Thread Eran Witkon
But what if I don't have more memory? On Wed, 16 Dec 2015 at 08:13 Zhan Zhang wrote: > There are two cases here. If the container is killed by yarn, you can > increase jvm overhead. Otherwise, you have to increase the executor-memory > if there is no memory leak

Need clarifications in Regression

2015-12-15 Thread Arunkumar Pillai
Hi The Regression algorithm in the MLlib is using Loss function to calculate the regression estimates and R is using matrix method to calculate the estimates. I see some difference between the results of Both Spark and R. I was using the following class LinearRegressionWithSGD.train(parsedData,

Re: Intercept in Linear Regression

2015-12-15 Thread Denny Lee
If you're using model = LinearRegressionWithSGD.train(parseddata, iterations=100, step=0.01, intercept=True) then to get the intercept, you would use model.intercept More information can be found at:

Re: ideal number of executors per machine

2015-12-15 Thread Sean Owen
1 per machine is the right number. If you are running very large heaps (>64GB) you may consider multiple per machine just to make sure each's GC pauses aren't excessive, but even this might be better mitigated with GC tuning. On Tue, Dec 15, 2015 at 9:07 PM, Veljko Skarich

NPE in using AvroKeyValueInputFormat for newAPIHadoopFile

2015-12-15 Thread Jinyuan Zhou
Hi, I tried to load avro files in hdfs but keep getting NPE. I am using AvroKeyValueInputFormat inside newAPIHadoopFile method. Anyone have any clue? Here is stack trace Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 0.0 failed 4

Re: Spark big rdd problem

2015-12-15 Thread Zhan Zhang
There are two cases here. If the container is killed by yarn, you can increase jvm overhead. Otherwise, you have to increase the executor-memory if there is no memory leak happening. Thanks. Zhan Zhang On Dec 15, 2015, at 9:58 PM, Eran Witkon

Re: [Spark 1.5]: Exception in thread "broadcast-hash-join-2" java.lang.OutOfMemoryError: Java heap space -- Work in 1.4, but 1.5 doesn't

2015-12-15 Thread Deenar Toraskar
Hi I have created an issue for this https://issues.apache.org/jira/browse/SPARK-12358 Regards Deenar On 16 December 2015 at 06:21, Deenar Toraskar wrote: > > > On 16 December 2015 at 06:19, Deenar Toraskar < > deenar.toras...@thinkreactive.co.uk> wrote: > >> Hi >>

which aws instance type for shuffle performance

2015-12-15 Thread Rastan Boroujerdi
I'm trying to determine whether I should be using 10 r3.8xlarge or 40 r3.2xlarge. I'm mostly concerned with shuffle performance of the application. If I go with r3.8xlarge I will need to configure 4 worker instances per machine to keep the JVM size down. The worker instances will likely contend