Re: Invalid ContainerId ... Caused by: java.lang.NumberFormatException: For input string: e04

2015-03-24 Thread Steve Loughran
On 24 Mar 2015, at 02:10, Marcelo Vanzin van...@cloudera.com wrote: This happens most probably because the Spark 1.3 you have downloaded is built against an older version of the Hadoop libraries than those used by CDH, and those libraries cannot parse the container IDs generated by CDH.

Spark as a service

2015-03-24 Thread Ashish Mukherjee
Hello, As of now, if I have to execute a Spark job, I need to create a jar and deploy it. If I need to run a dynamically formed SQL from a Web application, is there any way of using SparkSQL in this manner? Perhaps, through a Web Service or something similar. Regards, Ashish

Standalone Scheduler VS YARN Performance

2015-03-24 Thread Harut Martirosyan
What is performance overhead caused by YARN, or what configurations are being changed when the app is ran through YARN? The following example: sqlContext.sql(SELECT dayStamp(date), count(distinct deviceId) AS c FROM full GROUP BY dayStamp(date) ORDER BY c DESC LIMIT 10) .collect() runs on shell

Re: issue while creating spark context

2015-03-24 Thread Sean Owen
That's probably the problem; the intended path is on HDFS but the configuration specifies a local path. See the exception message. On Tue, Mar 24, 2015 at 1:08 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Its in your local file system, not in hdfs. Thanks Best Regards On Tue, Mar 24,

Re: issue while creating spark context

2015-03-24 Thread Sachin Singh
thanks Sean, please can you suggest in which file or configuration I need to modify proper path, please elaborate which may help, thanks, Regards Sachin On Tue, Mar 24, 2015 at 7:15 PM, Sean Owen so...@cloudera.com wrote: That's probably the problem; the intended path is on HDFS but the

Re: Spark streaming alerting

2015-03-24 Thread Helena Edelson
Streaming _from_ cassandra, CassandraInputDStream, is coming BTW https://issues.apache.org/jira/browse/SPARK-6283 https://issues.apache.org/jira/browse/SPARK-6283 I am working on it now. Helena @helenaedelson On Mar 23, 2015, at 5:22 AM, Khanderao Kand Gmail khanderao.k...@gmail.com wrote:

Re: issue while creating spark context

2015-03-24 Thread Sachin Singh
Hi Akhil, thanks for your quick reply, I would like to request please elaborate i.e. what kind of permission required .. thanks in advance, Regards Sachin On Tue, Mar 24, 2015 at 5:29 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Its an IOException, just make sure you are having the correct

Optimal solution for getting the header from CSV with Spark

2015-03-24 Thread Spico Florin
Hello! I would like to know what is the optimal solution for getting the header with from a CSV file with Spark? My aproach was: def getHeader(data: RDD[String]): String = { data.zipWithIndex().filter(_._2==0).map(x=x._1).take(1).mkString() } Thanks.

EC2 Having script run at startup

2015-03-24 Thread Theodore Vasiloudis
Hello, in the context of SPARK-2394 Make it easier to read LZO-compressed files from EC2 clusters https://issues.apache.org/jira/browse/SPARK-2394 , I was wondering: Is there an easy way to make a user-provided script run at every machine in a cluster launched on EC2? Regards, Theodore --

Re: issue while creating spark context

2015-03-24 Thread Sachin Singh
hi, I can see required permission is granted for this directory as under, hadoop dfs -ls /user/spark DEPRECATED: Use of this script to execute hdfs command is deprecated. Instead use the hdfs command for it. Found 1 items *drwxrwxrwt - spark spark 0 2015-03-20 01:04

Re: issue while creating spark context

2015-03-24 Thread Akhil Das
Its in your local file system, not in hdfs. Thanks Best Regards On Tue, Mar 24, 2015 at 6:25 PM, Sachin Singh sachin.sha...@gmail.com wrote: hi, I can see required permission is granted for this directory as under, hadoop dfs -ls /user/spark DEPRECATED: Use of this script to execute hdfs

Re: issue while creating spark context

2015-03-24 Thread Akhil Das
write permission as its clearly saying: java.io.IOException:* Error in creating log directory:* file:*/user/spark/*applicationHistory/application_1427194309307_0005 Thanks Best Regards On Tue, Mar 24, 2015 at 6:08 PM, Sachin Singh sachin.sha...@gmail.com wrote: Hi Akhil, thanks for your

issue while creating spark context

2015-03-24 Thread sachin Singh
hi all, all of sudden I getting below error when I am submitting spark job using master as yarn its not able to create spark context,previously working fine, I am using CDH5.3.1 and creating javaHiveContext spark-submit --jars

Re: diffrence in PCA of MLib vs H2o in R

2015-03-24 Thread Sean Owen
Those implementations are computing an SVD of the input matrix directly, and while you generally need the columns to have mean 0, you can turn that off with the options you cite. I don't think this is possible in the MLlib implementation, since it is computing the principal components by

Re: Spark as a service

2015-03-24 Thread Jeffrey Jedele
Hi Ashish, this might be what you're looking for: https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server Regards, Jeff 2015-03-24 11:28 GMT+01:00 Ashish Mukherjee ashish.mukher...@gmail.com: Hello, As of now, if I have to execute a Spark job, I

Re: Question about Data Sources API

2015-03-24 Thread Ashish Mukherjee
Hello Michael, Thanks for your quick reply. My question wrt Java/Scala was related to extending the classes to support new custom data sources, so was wondering if those could be written in Java, since our company is a Java shop. The additional push downs I am looking for are aggregations with

Re: Measuer Bytes READ and Peak Memory Usage for Query

2015-03-24 Thread anamika gupta
Yeah thanks, I can now see the memory usage. Please also verify if bytes read == Combined size of all RDDs ? Actually, all my RDDs are completely cached in memory. So, Combined size of my RDDs = Mem used (verified from WebUI) On Fri, Mar 20, 2015 at 12:07 PM, Akhil Das

Re: issue while creating spark context

2015-03-24 Thread Akhil Das
Its an IOException, just make sure you are having the correct permission over */user/spark* directory. Thanks Best Regards On Tue, Mar 24, 2015 at 5:21 PM, sachin Singh sachin.sha...@gmail.com wrote: hi all, all of sudden I getting below error when I am submitting spark job using master as

How to deploy binary dependencies to workers?

2015-03-24 Thread Xi Shen
Hi, I am doing ML using Spark mllib. However, I do not have full control to the cluster. I am using Microsoft Azure HDInsight I want to deploy the BLAS or whatever required dependencies to accelerate the computation. But I don't know how to deploy those DLLs when I submit my JAR to the cluster.

Re: Spark as a service

2015-03-24 Thread Jeffrey Jedele
I don't think there's are general approach to that - the usecases are just to different. If you really need it, you probably will have to implement yourself in the driver of your application. PS: Make sure to use the reply to all button so that the mailing list is included in your reply.

Re: Spark as a service

2015-03-24 Thread Todd Nist
Perhaps this project, https://github.com/calrissian/spark-jetty-server, could help with your requirements. On Tue, Mar 24, 2015 at 7:12 AM, Jeffrey Jedele jeffrey.jed...@gmail.com wrote: I don't think there's are general approach to that - the usecases are just to different. If you really need

1.3 Hadoop File System problem

2015-03-24 Thread Jim Carroll
I have code that works under 1.2.1 but when I upgraded to 1.3.0 it fails to find the s3 hadoop file system. I get the java.lang.IllegalArgumentException: Wrong FS: s3://path to my file], expected: file:/// when I try to save a parquet file. This worked in 1.2.1. Has anyone else seen this? I'm

Re: Invalid ContainerId ... Caused by: java.lang.NumberFormatException: For input string: e04

2015-03-24 Thread Manoj Samel
Thanks Marcelo - I was using the SBT built spark per earlier thread. I switched now to the distro (with the conf changes for CDH path in front) and guava issue is gone. Thanks, On Tue, Mar 24, 2015 at 1:50 PM, Marcelo Vanzin van...@cloudera.com wrote: Hi there, On Tue, Mar 24, 2015 at 1:40

Re: diffrence in PCA of MLib vs H2o in R

2015-03-24 Thread roni
Reza, That SVD.v matches the H2o and R prComp (non-centered) Thanks -R On Tue, Mar 24, 2015 at 11:38 AM, Sean Owen so...@cloudera.com wrote: (Oh sorry, I've only been thinking of TallSkinnySVD) On Tue, Mar 24, 2015 at 6:36 PM, Reza Zadeh r...@databricks.com wrote: If you want to do a

Re: java.lang.OutOfMemoryError: unable to create new native thread

2015-03-24 Thread Matt Silvey
My memory is hazy on this but aren't there hidden limitations to Linux-based threads? I ran into some issues a couple of years ago where, and here is the fuzzy part, the kernel wants to reserve virtual memory per thread equal to the stack size. When the total amount of reserved memory (not

Re: SparkSQL UDTs with Ordering

2015-03-24 Thread Patrick Woody
Awesome. yep - I have seen the warnings on UDTs, happy to keep up with the API changes :). Would this be a reasonable PR to toss up despite the API unstableness or would you prefer it to wait? Thanks -Pat On Tue, Mar 24, 2015 at 7:44 PM, Michael Armbrust mich...@databricks.com wrote: I'll

Re: diffrence in PCA of MLib vs H2o in R

2015-03-24 Thread Reza Zadeh
Great! On Tue, Mar 24, 2015 at 2:53 PM, roni roni.epi...@gmail.com wrote: Reza, That SVD.v matches the H2o and R prComp (non-centered) Thanks -R On Tue, Mar 24, 2015 at 11:38 AM, Sean Owen so...@cloudera.com wrote: (Oh sorry, I've only been thinking of TallSkinnySVD) On Tue, Mar 24,

Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-24 Thread David Holiday
hi all, got a vagrant image with spark notebook, spark, accumulo, and hadoop all running. from notebook I can manually create a scanner and pull test data from a table I created using one of the accumulo examples: val instanceNameS = accumulo val zooServersS = localhost:2181 val instance:

Re: Spark as a service

2015-03-24 Thread Irfan Ahmad
Also look at the spark-kernel and spark job server projects. Irfan On Mar 24, 2015 5:03 AM, Todd Nist tsind...@gmail.com wrote: Perhaps this project, https://github.com/calrissian/spark-jetty-server, could help with your requirements. On Tue, Mar 24, 2015 at 7:12 AM, Jeffrey Jedele

Re: How to avoid being killed by YARN node manager ?

2015-03-24 Thread Sandy Ryza
Hi Yuichiro, The way to avoid this is to boost spark.yarn.executor.memoryOverhead until the executors have enough off-heap memory to avoid going over their limits. -Sandy On Tue, Mar 24, 2015 at 11:49 AM, Yuichiro Sakamoto ks...@muc.biglobe.ne.jp wrote: Hello. We use ALS(Collaborative

Re: spark disk-to-disk

2015-03-24 Thread Koert Kuipers
imran, great, i will take a look at the pullreq. seems we are interested in similar things On Tue, Mar 24, 2015 at 11:00 AM, Imran Rashid iras...@cloudera.com wrote: I think writing to hdfs and reading it back again is totally reasonable. In fact, in my experience, writing to hdfs and reading

Re: How to deploy binary dependencies to workers?

2015-03-24 Thread Dean Wampler
Both spark-submit and spark-shell have a --jars option for passing additional jars to the cluster. They will be added to the appropriate classpaths. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com

How to avoid being killed by YARN node manager ?

2015-03-24 Thread Yuichiro Sakamoto
Hello. We use ALS(Collaborative filtering) of Spark MLlib on YARN. Spark version is 1.2.0 included CDH 5.3.1. 1,000,000,000 records(5,000,000 users data and 5,000,000 items data) are used for machine learning with ALS. These large quantities of data increases virtual memory usage, node manager

Re: Why doesn't the --conf parameter work in yarn-cluster mode (but works in yarn-client and local)?

2015-03-24 Thread Emre Sevinc
Hello Sandy, Thank you for your explanation. Then I would at least expect that to be consistent across local, yarn-client, and yarn-cluster modes. (And not lead to the case where it somehow works in two of them, and not for the third). Kind regards, Emre Sevinç http://www.bigindustries.be/ On

Re: spark disk-to-disk

2015-03-24 Thread Imran Rashid
I think writing to hdfs and reading it back again is totally reasonable. In fact, in my experience, writing to hdfs and reading back in actually gives you a good opportunity to handle some other issues as well: a) instead of just writing as an object file, I've found its helpful to write in a

Re: Spark streaming alerting

2015-03-24 Thread Anwar Rizal
Helena, The CassandraInputDStream sounds interesting. I dont find many things in the jira though. Do you have more details on what it tries to achieve ? Thanks, Anwar. On Tue, Mar 24, 2015 at 2:39 PM, Helena Edelson helena.edel...@datastax.com wrote: Streaming _from_ cassandra,

Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-03-24 Thread Doug Balog
I found the problem. In mapped-site.xml, mapreduce.application.classpath has references to “${hdp.version}” which is not getting replaced when launch_container.sh is created. The executor fails with a substitution error at line 27 in launch_container.sh because bash can’t deal with

Re: Why doesn't the --conf parameter work in yarn-cluster mode (but works in yarn-client and local)?

2015-03-24 Thread Sandy Ryza
Ah, yes, I believe this is because only properties prefixed with spark get passed on. The purpose of the --conf option is to allow passing Spark properties to the SparkConf, not to add general key-value pairs to the JVM system properties. -Sandy On Tue, Mar 24, 2015 at 4:25 AM, Emre Sevinc

Re: Spark streaming alerting

2015-03-24 Thread Helena Edelson
I created a jira ticket for my work in both the spark and spark-cassandra-connector JIRAs, I don’t know why you can not see them. Users can stream from any cassandra table, just as one can stream from a Kafka topic; same principle. Helena @helenaedelson On Mar 24, 2015, at 11:29 AM, Anwar

Does HiveContext connect to HiveServer2?

2015-03-24 Thread nitinkak001
I am wondering if HiveContext connects to HiveServer2 or does it work though Hive CLI. The reason I am asking is because Cloudera has deprecated Hive CLI. If the connection is through HiverServer2, is there a way to specify user credentials? -- View this message in context:

Re: Invalid ContainerId ... Caused by: java.lang.NumberFormatException: For input string: e04

2015-03-24 Thread Sandy Ryza
Steve, that's correct, but the problem only shows up when different versions of the YARN jars are included on the classpath. -Sandy On Tue, Mar 24, 2015 at 6:29 AM, Steve Loughran ste...@hortonworks.com wrote: On 24 Mar 2015, at 02:10, Marcelo Vanzin van...@cloudera.com wrote: This

filter expression in API document for DataFrame

2015-03-24 Thread SK
The following statement appears in the Scala API example at https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame people.filter(age 30). I tried this example and it gave a compilation error. I think this needs to be changed to people.filter(people(age) 30)

Graphx gets slower as the iteration number increases

2015-03-24 Thread orangepri...@foxmail.com
I'm working with graphx to calculate the pageranks of an extreme large social network with billion verteces. As iteration number increases, the speed of each iteration becomes slower and unacceptable. Is there any reason of it? How can I accelerate the ineration process?

Re: 1.3 Hadoop File System problem

2015-03-24 Thread Patrick Wendell
Hey Jim, Thanks for reporting this. Can you give a small end-to-end code example that reproduces it? If so, we can definitely fix it. - Patrick On Tue, Mar 24, 2015 at 4:55 PM, Jim Carroll jimfcarr...@gmail.com wrote: I have code that works under 1.2.1 but when I upgraded to 1.3.0 it fails to

Re: Graphx gets slower as the iteration number increases

2015-03-24 Thread Ankur Dave
This might be because partitions are getting dropped from memory and needing to be recomputed. How much memory is in the cluster, and how large are the partitions? This information should be in the Executors and Storage pages in the web UI. Ankur http://www.ankurdave.com/ On Tue, Mar 24, 2015 at

Re: 1.3 Hadoop File System problem

2015-03-24 Thread Michael Armbrust
You are probably hitting SPARK-6351 https://issues.apache.org/jira/browse/SPARK-6351, which will be fixed in 1.3.1 (hopefully cutting an RC this week). On Tue, Mar 24, 2015 at 4:55 PM, Jim Carroll jimfcarr...@gmail.com wrote: I have code that works under 1.2.1 but when I upgraded to 1.3.0 it

column expression in left outer join for DataFrame

2015-03-24 Thread SK
Hi, I am trying to port some code that was working in Spark 1.2.0 on the latest version, Spark 1.3.0. This code involves a left outer join between two SchemaRDDs which I am now trying to change to a left outer join between 2 DataFrames. I followed the example for left outer join of DataFrame at

Re: column expression in left outer join for DataFrame

2015-03-24 Thread Michael Armbrust
You need to use `===`, so that you are constructing a column expression instead of evaluating the standard scala equality method. Calling methods to access columns (i.e. df.county is only supported in python). val join_df = df1.join( df2, df1(country) === df2(country), left_outer) On Tue, Mar

Re: java.lang.OutOfMemoryError: unable to create new native thread

2015-03-24 Thread Thomas Gerber
So, 1. I reduced my -XX:ThreadStackSize to 5m (instead of 10m - default is 1m), which is still OK for my need. 2. I reduced the executor memory to 44GB for a 60GB machine (instead of 49GB). This seems to have helped. Thanks to Matthew and Sean. Thomas On Tue, Mar 24, 2015 at 3:49 PM, Matt

Re: issue while creating spark context

2015-03-24 Thread Sachin Singh
thanks Sean and Akhil, I changed the the permission of */user/spark/applicationHistory, *now it works, On Tue, Mar 24, 2015 at 7:35 PM, Sachin Singh sachin.sha...@gmail.com wrote: thanks Sean, please can you suggest in which file or configuration I need to modify proper path, please

Re: Errors in SPARK

2015-03-24 Thread Denny Lee
The error you're seeing typically means that you cannot connect to the Hive metastore itself. Some quick thoughts: - If you were to run show tables (instead of the CREATE TABLE statement), are you still getting the same error? - To confirm, the Hive metastore (MySQL database) is up and running

Re: Errors in SPARK

2015-03-24 Thread sandeep vura
Hi Denny, Still facing the same issue.Please find the following errors. *scala val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)* *sqlContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@4e4f880c* *scala sqlContext.sql(CREATE TABLE IF NOT EXISTS

Re: Errors in SPARK

2015-03-24 Thread sandeep vura
No I am just running ./spark-shell command in terminal I will try with above command On Wed, Mar 25, 2015 at 11:09 AM, Denny Lee denny.g@gmail.com wrote: Did you include the connection to a MySQL connector jar so that way spark-shell / hive can connect to the metastore? For example, when

Re: Errors in SPARK

2015-03-24 Thread Denny Lee
Did you include the connection to a MySQL connector jar so that way spark-shell / hive can connect to the metastore? For example, when I run my spark-shell instance in standalone mode, I use: ./spark-shell --master spark://servername:7077 --driver-class-path /lib/mysql-connector-java-5.1.27.jar

Hadoop 2.5 not listed in Spark 1.4 build page

2015-03-24 Thread Manoj Samel
http://spark.apache.org/docs/latest/building-spark.html#packaging-without-hadoop-dependencies-for-yarn does not list hadoop 2.5 in Hadoop version table table etc. I assume it is still OK to compile with -Pyarn -Phadoop-2.5 for use with Hadoop 2.5 (cdh 5.3.2) Thanks,

Re: Weird exception in Spark job

2015-03-24 Thread nitinkak001
Any Ideas on this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Weird-exception-in-Spark-job-tp22195p22204.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Does HiveContext connect to HiveServer2?

2015-03-24 Thread Marcelo Vanzin
spark-submit --files /path/to/hive-site.xml On Tue, Mar 24, 2015 at 10:31 AM, Udit Mehta ume...@groupon.com wrote: Another question related to this, how can we propagate the hive-site.xml to all workers when running in the yarn cluster mode? On Tue, Mar 24, 2015 at 10:09 AM, Marcelo Vanzin

Dataframe groupby custom functions (python)

2015-03-24 Thread jamborta
Hi all, I have been trying out the new dataframe api in 1.3, which looks great by the way. I have found an example to define udfs and add them to select operations, like this: slen = F.udf(lambda s: len(s), IntegerType()) df.select(df.age, slen(df.name).alias('slen')).collect() is it possible

Re: Dataframe groupby custom functions (python)

2015-03-24 Thread Michael Armbrust
The only UDAFs that we support today are those defined using the Hive UDAF API. Otherwise you'll have to drop into Spark operations. I'd suggest opening a JIRA. On Tue, Mar 24, 2015 at 10:49 AM, jamborta jambo...@gmail.com wrote: Hi all, I have been trying out the new dataframe api in 1.3,

Re: diffrence in PCA of MLib vs H2o in R

2015-03-24 Thread Reza Zadeh
If you want to do a nonstandard (or uncentered) PCA, you can call computeSVD on RowMatrix, and look at the resulting 'V' Matrix. That should match the output of the other two systems. Reza On Tue, Mar 24, 2015 at 3:53 AM, Sean Owen so...@cloudera.com wrote: Those implementations are computing

Re: diffrence in PCA of MLib vs H2o in R

2015-03-24 Thread Sean Owen
(Oh sorry, I've only been thinking of TallSkinnySVD) On Tue, Mar 24, 2015 at 6:36 PM, Reza Zadeh r...@databricks.com wrote: If you want to do a nonstandard (or uncentered) PCA, you can call computeSVD on RowMatrix, and look at the resulting 'V' Matrix. That should match the output of the

Re: How to deploy binary dependencies to workers?

2015-03-24 Thread DB Tsai
I would recommend to upload those jars to HDFS, and use add jars option in spark-submit with URI from HDFS instead of URI from local filesystem. Thus, it can avoid the problem of fetching jars from driver which can be a bottleneck. Sincerely, DB Tsai

Re: Optimal solution for getting the header from CSV with Spark

2015-03-24 Thread Sean Owen
I think this works in practice, but I don't know that the first block of the file is guaranteed to be in the first partition? certainly later down the pipeline that won't be true but presumably this is happening right after reading the file. I've always just written some filter that would only

Re: Does HiveContext connect to HiveServer2?

2015-03-24 Thread Marcelo Vanzin
It does neither. If you provide a Hive configuration to Spark, HiveContext will connect to your metastore server, otherwise it will create its own metastore in the working directory (IIRC). On Tue, Mar 24, 2015 at 8:58 AM, nitinkak001 nitinkak...@gmail.com wrote: I am wondering if HiveContext

Re: Is yarn-standalone mode deprecated?

2015-03-24 Thread Sandy Ryza
I checked and apparently it hasn't be released yet. it will be available in the upcoming CDH 5.4 release. -Sandy On Mon, Mar 23, 2015 at 1:32 PM, Nitin kak nitinkak...@gmail.com wrote: I know there was an effort for this, do you know which version of Cloudera distribution we could find that?

akka.version error

2015-03-24 Thread Mohit Anchlia
I am facing the same issue as listed here: http://apache-spark-user-list.1001560.n3.nabble.com/Packaging-a-spark-job-using-maven-td5615.html Solution mentioned is here: https://gist.github.com/prb/d776a47bd164f704eecb However, I think I don't understand few things: 1) Why are jars being split

What his the ideal method to interact with Spark Cluster from a Cloud App?

2015-03-24 Thread Noorul Islam K M
Hi all, We have a cloud application, to which we are adding a reporting service. For this we have narrowed down to use Cassandra + Spark for data store and processing respectively. Since cloud application is separate from Cassandra + Spark deployment, what is ideal method to interact with Spark

Re: Optimal solution for getting the header from CSV with Spark

2015-03-24 Thread Dean Wampler
Good point. There's no guarantee that you'll get the actual first partition. One reason why I wouldn't allow a CSV header line in a real data file, if I could avoid it. Back to Spark, a safer approach is RDD.foreachPartition, which takes a function expecting an iterator. You'll only need to grab

Re: Is it possible to use json4s 3.2.11 with Spark 1.3.0?

2015-03-24 Thread Marcelo Vanzin
From the exception it seems like your app is also repackaging Scala classes somehow. Can you double check that and remove the Scala classes from your app if they're there? On Mon, Mar 23, 2015 at 10:07 PM, Alexey Zinoviev alexey.zinov...@gmail.com wrote: Thanks Marcelo, this options solved the

CombineByKey - Please explain its working

2015-03-24 Thread ashish.usoni
I am reading about combinebyKey and going through below example from one of the blog post but i cant understand how it works step by step , Can some one please explain Case class Fruit ( kind : String , weight : Int ) { def makeJuice : Juice = Juice ( weight * 100 ) } Case

Re: Question about Data Sources API

2015-03-24 Thread Michael Armbrust
My question wrt Java/Scala was related to extending the classes to support new custom data sources, so was wondering if those could be written in Java, since our company is a Java shop. Yes, you should be able to extend the required interfaces using Java. The additional push downs I am

Spark GraphX In Action on documentation page?

2015-03-24 Thread Michael Malak
Can my new book, Spark GraphX In Action, which is currently in MEAP http://manning.com/malak/, be added to https://spark.apache.org/documentation.html and, if appropriate, to https://spark.apache.org/graphx/ ? Michael Malak -

Re: Spark-thriftserver Issue

2015-03-24 Thread Anubhav Agarwal
Zhan specifying port fixed the port issue. Is it possible to specify the log directory while starting the spark thriftserver? Still getting this error even through the folder exists and everyone has permission to use that directory. drwxr-xr-x 2 root root 4096 Mar 24 19:04

FAILED SelectChannelConnector@0.0.0.0:4040 java.net.BindException: Address already in use

2015-03-24 Thread , Roy
I get following message for each time I run spark job 1. 15/03/24 15:35:56 WARN AbstractLifeCycle: FAILED SelectChannelConnector@0.0.0.0:4040: java.net.BindException: Address already in use full trace is here http://pastebin.com/xSvRN01f how do I fix this ? I am on CDH 5.3.1

Re: SparkSQL UDTs with Ordering

2015-03-24 Thread Michael Armbrust
I'll caution that the UDTs are not a stable public interface yet. We'd like to do this someday, but currently this feature is mostly for MLlib as we have not finalized the API. Having an ordering could be useful, but I'll add that currently UDTs actually exist in serialized from so the ordering

Spark Application Hung

2015-03-24 Thread Ashish Rawat
Hi, We are observing a hung spark application when one of the yarn datanode (running multiple spark executors) go down. Setup details: * Spark: 1.2.1 * Hadoop: 2.4.0 * Spark Application Mode: yarn-client * 2 datanodes (DN1, DN2) * 6 spark executors (initially 3 executors on

Re: hadoop input/output format advanced control

2015-03-24 Thread Nick Pentreath
You can indeed override the Hadoop configuration at a per-RDD level - though it is a little more verbose, as in the below example, and you need to effectively make a copy of the hadoop Configuration: val thisRDDConf = new Configuration(sc.hadoopConfiguration)

SparkSQL UDTs with Ordering

2015-03-24 Thread Patrick Woody
Hey all, Currently looking into UDTs and I was wondering if it is reasonable to add the ability to define an Ordering (or if this is possible, then how)? Currently it will throw an error when non-Native types are used. Thanks! -Pat

java.lang.OutOfMemoryError: unable to create new native thread

2015-03-24 Thread Thomas Gerber
Hello, I am seeing various crashes in spark on large jobs which all share a similar exception: java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:714) I increased nproc (i.e. ulimit -u) 10 fold, but it

Re: Spark-thriftserver Issue

2015-03-24 Thread Zhan Zhang
You can try to set it in spark-env.sh. # - SPARK_LOG_DIR Where log files are stored. (Default: ${SPARK_HOME}/logs) # - SPARK_PID_DIR Where the pid file is stored. (Default: /tmp) Thanks. Zhan Zhang On Mar 24, 2015, at 12:10 PM, Anubhav Agarwal

diffrence in PCA of MLib vs H2o in R

2015-03-24 Thread roni
I am trying to compute PCA using computePrincipalComponents. I also computed PCA using h2o in R and R's prcomp. The answers I get from H2o and R's prComp (non h2o) is same when I set the options for H2o as standardized=FALSE and for r's prcomp as center = false. How do I make sure that the

Re: Hive context datanucleus error

2015-03-24 Thread Udit Mehta
has this issue been fixed in spark 1.2: https://issues.apache.org/jira/browse/SPARK-2624 On Mon, Mar 23, 2015 at 9:19 PM, Udit Mehta ume...@groupon.com wrote: I am trying to run a simple query to view tables in my hive metastore using hive context. I am getting this error: spark Persistence

Re: Standalone Scheduler VS YARN Performance

2015-03-24 Thread Denny Lee
By any chance does this thread address look similar: http://apache-spark-developers-list.1001551.n3.nabble.com/Lost-executor-on-YARN-ALS-iterations-td7916.html ? On Tue, Mar 24, 2015 at 5:23 AM Harut Martirosyan harut.martiros...@gmail.com wrote: What is performance overhead caused by YARN,

Re: Optimal solution for getting the header from CSV with Spark

2015-03-24 Thread Dean Wampler
Instead of data.zipWithIndex().filter(_._2==0), which will cause Spark to read the whole file, use data.take(1), which is simpler. From the Rdd.take documentation, it works by first scanning one partition, and using the results from that partition to estimate the number of additional partitions

Re: Invalid ContainerId ... Caused by: java.lang.NumberFormatException: For input string: e04

2015-03-24 Thread Manoj Samel
Thanks All - perhaps I misread the earlier posts as dependencies with Hadoop version, but the key is also the CDH 5.3.2 (not just Hadoop 2.5 v/s 2.4) etc. After adding the classPath as Marcelo/Harsh suggested (loading CDH libs front), I am able to get spark-shell started without invalid container

Re: Invalid ContainerId ... Caused by: java.lang.NumberFormatException: For input string: e04

2015-03-24 Thread Marcelo Vanzin
Hi there, On Tue, Mar 24, 2015 at 1:40 PM, Manoj Samel manojsamelt...@gmail.com wrote: When I run any query, it gives java.lang.NoSuchMethodError: com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode; Are you running a custom-compiled Spark by any chance?

updateStateByKey - Seq[V] order

2015-03-24 Thread Adrian Mocanu
Hi Does updateStateByKey pass elements to updateFunc (in Seq[V]) in order in which they appear in the RDD? My guess is no which means updateFunc needs to be commutative. Am I correct? I've asked this question before but there were no takers. Here's the scala docs for updateStateByKey /** *

Re: java.lang.OutOfMemoryError: unable to create new native thread

2015-03-24 Thread Sean Owen
I doubt you're hitting the limit of threads you can spawn, but as you say, running out of memory that the JVM process is allowed to allocate since your threads are grabbing stacks 10x bigger than usual. The thread stacks are 4GB by themselves. I suppose you can't not up the stack size so much?

Re: Registering custom UDAFs with HiveConetxt in SparkSQL, how?

2015-03-24 Thread Jon Chase
Shahab - This should do the trick until Hao's changes are out: sqlContext.sql(create temporary function foobar as 'com.myco.FoobarUDAF'); sqlContext.sql(select foobar(some_column) from some_table); This works without requiring to 'deploy' a JAR with the UDAF in it - just make sure the UDAF

Re: Question about Data Sources API

2015-03-24 Thread Michael Armbrust
On Tue, Mar 24, 2015 at 12:57 AM, Ashish Mukherjee ashish.mukher...@gmail.com wrote: 1. Is the Data Source API stable as of Spark 1.3.0? It is marked DeveloperApi, but in general we do not plan to change even these APIs unless there is a very compelling reason to. 2. The Data Source API

Re: java.lang.OutOfMemoryError: unable to create new native thread

2015-03-24 Thread Thomas Gerber
Additional notes: I did not find anything wrong with the number of threads (ps -u USER -L | wc -l): around 780 on the master and 400 on executors. I am running on 100 r3.2xlarge. On Tue, Mar 24, 2015 at 12:38 PM, Thomas Gerber thomas.ger...@radius.com wrote: Hello, I am seeing various crashes

Re: Hadoop 2.5 not listed in Spark 1.4 build page

2015-03-24 Thread Denny Lee
Hadoop 2.5 would be referenced as via -Dhadoop-2.5 using the profile -Phadoop-2.4 Please note earlier in the link the section: # Apache Hadoop 2.4.X or 2.5.X mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=VERSION -DskipTests clean package Versions of Hadoop after 2.5.X may or may not work with the

Re: Hadoop 2.5 not listed in Spark 1.4 build page

2015-03-24 Thread Sean Owen
The right invocation is still a bit different: ... -Phadoop-2.4 -Dhadoop.version=2.5.0 hadoop-2.4 == Hadoop 2.4+ On Tue, Mar 24, 2015 at 5:44 PM, Denny Lee denny.g@gmail.com wrote: Hadoop 2.5 would be referenced as via -Dhadoop-2.5 using the profile -Phadoop-2.4 Please note earlier in

Re: Does HiveContext connect to HiveServer2?

2015-03-24 Thread Udit Mehta
Another question related to this, how can we propagate the hive-site.xml to all workers when running in the yarn cluster mode? On Tue, Mar 24, 2015 at 10:09 AM, Marcelo Vanzin van...@cloudera.com wrote: It does neither. If you provide a Hive configuration to Spark, HiveContext will connect to

spark worker on mesos slave | possible networking config issue

2015-03-24 Thread Anirudha Jadhav
is there some setting i am missing: this is my spark-env.sh export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so export SPARK_EXECUTOR_URI=http://100.125.5.93/sparkx.tgz export SPARK_LOCAL_IP=127.0.0.1 here is what i see on the slave node. less

Spark SQL: Day of month from Timestamp

2015-03-24 Thread Harut Martirosyan
Hi guys. Basically, we had to define a UDF that does that, is there a built in function that we can use for it? -- RGRDZ Harut

Re: Spark SQL: Day of month from Timestamp

2015-03-24 Thread Arush Kharbanda
Hi You can use functions like year(date),month(date) Thanks Arush On Tue, Mar 24, 2015 at 12:46 PM, Harut Martirosyan harut.martiros...@gmail.com wrote: Hi guys. Basically, we had to define a UDF that does that, is there a built in function that we can use for it? -- RGRDZ Harut --

Question about Data Sources API

2015-03-24 Thread Ashish Mukherjee
Hello, I have some questions related to the Data Sources API - 1. Is the Data Source API stable as of Spark 1.3.0? 2. The Data Source API seems to be available only in Scala. Is there any plan to make it available for Java too? 3. Are only filters and projections pushed down to the data

Re: FAILED SelectChannelConnector@0.0.0.0:4040 java.net.BindException: Address already in use

2015-03-24 Thread Marcelo Vanzin
Does your application actually fail? That message just means there's another application listening on that port. Spark should try to bind to a different one after that and keep going. On Tue, Mar 24, 2015 at 12:43 PM, , Roy rp...@njit.edu wrote: I get following message for each time I run spark