Re: How to insert complex types like mapstring,mapstring,int in spark sql

2014-11-25 Thread critikaled
https://github.com/apache/spark/blob/84d79ee9ec47465269f7b0a7971176da93c96f3f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala Doesn't look like spark sql support nested complex types right now -- View this message in context:

Spark SQL Join returns less rows that expected

2014-11-25 Thread david
Hi, I have 2 files which come from csv import of 2 Oracle tables. F1 has 46730613 rows F2 has 3386740 rows I build 2 tables with spark. Table F1 join with table F2 on c1=d1. All keys F2.d1 exists in F1.c1, so i expect to retrieve 46730613 rows. But it returns only 3437 rows // ---

Understanding stages in WebUI

2014-11-25 Thread Tsai Li Ming
Hi, I have the classic word count example: file.flatMap(line = line.split( )).map(word = (word,1)).reduceByKey(_ + _).collect() From the Job UI, I can only see 2 stages: 0-collect and 1-map. What happened to ShuffledRDD in reduceByKey? And both flatMap and map operations is collapsed into a

RE: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava

2014-11-25 Thread Judy Nash
Made progress but still blocked. After recompiling the code on cmd instead of PowerShell, now I can see all 5 classes as you mentioned. However I am still seeing the same error as before. Anything else I can check for? From: Judy Nash [mailto:judyn...@exchange.microsoft.com] Sent: Monday,

Re: How to insert complex types like mapstring,mapstring,int in spark sql

2014-11-25 Thread critikaled
Exactly that seems to be the problem will have to wait for the next release -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-complex-types-like-map-string-map-string-int-in-spark-sql-tp19603p19734.html Sent from the Apache Spark User List

Re: How to assign consecutive numeric id to each row based on its content?

2014-11-25 Thread shahab
Thanks a lot, both solutions work. best, /Shahab On Tue, Nov 18, 2014 at 5:28 PM, Daniel Siegmann daniel.siegm...@velos.io wrote: I think zipWithIndex is zero-based, so if you want 1 to N, you'll need to increment them like so: val r2 = r1.keys.distinct().zipWithIndex().mapValues(_ + 1)

Re: Ideas on how to use Spark for anomaly detection on a stream of data

2014-11-25 Thread Sean Owen
Yes, and I prepared a basic talk on this exact topic. Slides here: http://www.slideshare.net/srowen/anomaly-detection-with-apache-spark-41975155 This is elaborated in a chapter of an upcoming book that's available in early release; you can look at the accompanying source code to get some ideas

Re: Spark SQL - Any time line to move beyond Alpha version ?

2014-11-25 Thread Matei Zaharia
The main reason for the alpha tag is actually that APIs might still be evolving, but we'd like to freeze the API as soon as possible. Hopefully it will happen in one of 1.3 or 1.4. In Spark 1.2, we're adding an external data source API that we'd like to get experience with before freezing it.

Re: streaming linear regression is not building the model

2014-11-25 Thread Yanbo Liang
Computing will be triggered by new files added in the directory. If you place new files to the directory and it will start training the model. 2014-11-11 5:03 GMT+08:00 Bui, Tri tri@verizonwireless.com.invalid: Hi, The model weight is not updating for streaming linear regression. The

Re: How to insert complex types like mapstring,mapstring,int in spark sql

2014-11-25 Thread Cheng Lian
Spark SQL supports complex types, but casting doesn't work for complex types right now. On 11/25/14 4:04 PM, critikaled wrote: https://github.com/apache/spark/blob/84d79ee9ec47465269f7b0a7971176da93c96f3f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala Doesn't

Re: Spark SQL Join returns less rows that expected

2014-11-25 Thread Cheng Lian
Which version are you using? Or if you are using the most recent master or branch-1.2, which commit are you using? On 11/25/14 4:08 PM, david wrote: Hi, I have 2 files which come from csv import of 2 Oracle tables. F1 has 46730613 rows F2 has 3386740 rows I build 2 tables with

Re: Inaccurate Estimate of weights model from StreamingLinearRegressionWithSGD

2014-11-25 Thread Yanbo Liang
The case run correctly in my environment. 14/11/25 17:48:20 INFO regression.StreamingLinearRegressionWithSGD: Model updated at time 141690890 ms 14/11/25 17:48:20 INFO regression.StreamingLinearRegressionWithSGD: Current model: weights, [0.8588] Can you provide more detail

Re: Unable to use Kryo

2014-11-25 Thread Daniel Haviv
The problem was I didn't use the correct class name, it should be org.apache.spark.*serializer*.KryoSerializer On Mon, Nov 24, 2014 at 11:12 PM, Daniel Haviv danielru...@gmail.com wrote: Hi, I want to test Kryo serialization but when starting spark-shell I'm hitting the following error:

K-means clustering

2014-11-25 Thread amin mohebbi
 I  have generated a sparse matrix by python, which has the size of   4000*174000 (.pkl), the following is a small part of this matrix :  (0, 45) 1  (0, 413) 1  (0, 445) 1  (0, 107) 4  (0, 80) 2  (0, 352) 1  (0, 157) 1  (0, 191) 1  (0, 315) 1  (0, 395) 4  (0, 282) 3  (0, 184) 1  (0, 403) 1  (0,

restructure key-value pair with lambda in java

2014-11-25 Thread robertrichter
Hello, I have a key value pair, whose value is an ArrayList and I would like to move one value of the ArrayList to the key position and the key back into the ArrayList. Is it possible to do tis with java lambda expression? This workes in python: newMap = sourceMap.map(lambda (key,((value1,

Lifecycle of RDD in spark-streaming

2014-11-25 Thread Mukesh Jha
Hey Experts, I wanted to understand in detail about the lifecycle of rdd(s) in a streaming app. From my current understanding - rdd gets created out of the realtime input stream. - Transform(s) functions are applied in a lazy fashion on the RDD to transform into another rdd(s). - Actions are

RE: Control number of parquet generated from JavaSchemaRDD

2014-11-25 Thread Naveen Kumar Pokala
Hi, While submitting your spark job mention --executor-cores 2 --num-executors 24 it will divide the dataset into 24*2 parquet files. Or set spark.default.parallelism value like 50 on sparkconf object. It will divide the dataset into 50 files into your HDFS. -Naveen -Original

Re: advantages of SparkSQL?

2014-11-25 Thread mrm
Thank you for answering, this is all very helpful! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/advantages-of-SparkSQL-tp19661p19753.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Spark cluster with Java 8 using ./spark-ec2

2014-11-25 Thread Jon Chase
I'm trying to use the spark-ec2 command to launch a Spark cluster that runs Java 8, but so far I haven't been able to get the Spark processes to use the right JVM at start up. Here's the command I use for launching the cluster. Note I'm using the user-data feature to install Java 8: ./spark-ec2

ALS train error

2014-11-25 Thread Saurabh Agrawal
Hi, I am getting the following error val model = ALS.train(ratings, rank, numIterations, 0.01) org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 103.0 failed 1 times, most recent failure: Lost task 1.0 in stage 103.0 (TID 3, localhost): scala.MatchError:

Remapping columns from a schemaRDD

2014-11-25 Thread Daniel Haviv
Hi, I'm selecting columns from a json file, transform some of them and would like to store the result as a parquet file but I'm failing. This is what I'm doing: val jsonFiles=sqlContext.jsonFile(/requests.loading) jsonFiles.registerTempTable(jRequests) val clean_jRequests=sqlContext.sql(select

RE: Spark Streaming with Python

2014-11-25 Thread Venkat, Ankam
Any idea how to resolve this? Regards, Venkat From: Venkat, Ankam Sent: Sunday, November 23, 2014 12:05 PM To: 'user@spark.apache.org' Subject: Spark Streaming with Python I am trying to run network_wordcount.py example mentioned at

RE: Inaccurate Estimate of weights model from StreamingLinearRegressionWithSGD

2014-11-25 Thread Bui, Tri
Thanks Liang! It was my bad, I fat finger one of the data point, correct it and the result match with yours. I am still not able to get the intercept. I am getting [error] /data/project/LinearRegression/src/main/scala/StreamingLinearRegression.scala:47: value setIntercept mber of

Re: Spark and Stanford CoreNLP

2014-11-25 Thread Christopher Manning
I’m not (yet!) an active Spark user, but saw this thread on twitter … and am involved with Stanford CoreNLP. Could someone explain how things need to be to work better with Spark — since that would be a useful goal. That is, while Stanford CoreNLP is not quite uniform (being developed by

Spark yarn cluster Application Master not running yarn container

2014-11-25 Thread firemonk9
I am running a 3 node(32 core, 60gb) Yarn cluster for Spark jobs. 1) Below are my Yarn memory settings yarn.nodemanager.resource.memory-mb = 52224 yarn.scheduler.minimum-allocation-mb = 40960 yarn.scheduler.maximum-allocation-mb = 52224 Apache Spark Memory Settings export

why MatrixFactorizationModel private?

2014-11-25 Thread jamborta
Hi all, seems that all the mllib models are declared accessible in the package, except MatrixFactorizationModel, which is declared private to mllib. Any reason why? thanks, -- View this message in context:

Re: Spark and Stanford CoreNLP

2014-11-25 Thread Evan R. Sparks
Chris, Thanks for stopping by! Here's a simple example. Imagine I've got a corpus of data, which is an RDD[String], and I want to do some POS tagging on it. In naive spark, that might look like this: val props = new Properties.setAnnotators(pos) val proc = new StanfordCoreNLP(props) val data =

Re: How to keep a local variable in each cluster?

2014-11-25 Thread zh8788
Any comments? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-keep-a-local-variable-in-each-cluster-tp19604p19766.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Spark shell running on mesos

2014-11-25 Thread José Guilherme Vanz
Hi! I started play with Spark some days ago and now I'm configuring a little cluster to play during my development. For this task, I'm using Apache Mesos running in Linux container managed by Docker. The mesos master and slave are running. I can see the webui and everything looks fine. I am

Re: Remapping columns from a schemaRDD

2014-11-25 Thread Michael Armbrust
Probably the easiest/closest way to do this would be with a UDF, something like: registerFunction(makeString, (s: Seq[String]) = s.mkString(,)) sql(SELECT *, makeString(c8) AS newC8 FROM jRequests) Although this does not modify a column, but instead appends a new column. Another more

Re: Control number of parquet generated from JavaSchemaRDD

2014-11-25 Thread Michael Armbrust
repartition and coalesce should both allow you to achieve what you describe. Can you maybe share the code that is not working? On Mon, Nov 24, 2014 at 8:24 PM, tridib tridib.sama...@live.com wrote: Hello, I am reading around 1000 input files from disk in an RDD and generating parquet. It

Re: Merging Parquet Files

2014-11-25 Thread Michael Armbrust
You'll need to be running a very recent version of Spark SQL as this feature was just added. On Tue, Nov 25, 2014 at 1:01 AM, Daniel Haviv danielru...@gmail.com wrote: Hi, Thanks for your reply.. I'm trying to do what you suggested but I get: scala sqlContext.sql(CREATE TEMPORARY TABLE data

RDD Cache Cleanup

2014-11-25 Thread sranga
Hi I am noticing that the RDDs that are persisted get cleaned up very quickly. This usually happens in a matter of a few minutes. I tried setting a value of 20 hours for the /spark.cleaner.ttl/ property and still get the same behavior. In my use-case, I have to persist about 20 RDDs each of size

Re: Control number of parquet generated from JavaSchemaRDD

2014-11-25 Thread tridib
I am experimenting with two files and trying to generate 1 parquet file. public class CompactParquetGenerator implements Serializable { public void generateParquet(JavaSparkContext sc, String jsonFilePath, String parquetPath) { //int MB_128 = 128*1024*1024;

Re: Is spark streaming +MlLib for online learning?

2014-11-25 Thread Xiangrui Meng
In 1.2, we added streaming k-means: https://github.com/apache/spark/pull/2942 . -Xiangrui On Mon, Nov 24, 2014 at 5:25 PM, Joanne Contact joannenetw...@gmail.com wrote: Thank you Tobias! On Mon, Nov 24, 2014 at 5:13 PM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, On Tue, Nov 25, 2014 at

Why is this operation so expensive

2014-11-25 Thread Steve Lewis
I have an JavaPairRDDKeyType,Tuple2Type1,Type2 originalPairs. There are on the order of 100 million elements I call a function to rearrange the tuples JavaPairRDDString,Tuple2Type1,Type2 newPairs = originalPairs.values().mapToPair(new PairFunctionTuple2Type1,Type2, String, Tuple2IType1,Type2

Re: Control number of parquet generated from JavaSchemaRDD

2014-11-25 Thread tridib
public void generateParquet(JavaSparkContext sc, String jsonFilePath, String parquetPath) { //int MB_128 = 128*1024*1024; //sc.hadoopConfiguration().setInt(dfs.blocksize, MB_128); //sc.hadoopConfiguration().setInt(parquet.block.size, MB_128); JavaSQLContext

Re: K-means clustering

2014-11-25 Thread Xiangrui Meng
There is a simple example here: https://github.com/apache/spark/blob/master/examples/src/main/python/kmeans.py . You can take advantage of sparsity by computing the distance via inner products: http://spark-summit.org/2014/talk/sparse-data-support-in-mllib-2 -Xiangrui On Tue, Nov 25, 2014 at 2:39

Spark sql UDF for array aggergation

2014-11-25 Thread Barua, Seemanto
Hi, I am looking for some resources/tutorials that will help me achive this: My JavaSchemaRDD is from JSON objects like below. How do I go about writing a UDF aggregate function let's say 'vectorAgg' which I can call from sql that returns one result array that is a positional aggregate across

Re: MlLib Colaborative filtering factors

2014-11-25 Thread Xiangrui Meng
It is data-dependent, and hence needs hyper-parameter tuning, e.g., grid search. The first batch is certainly expensive. But after you figure out a small range for each parameter that fits your data, following batches should be not that expensive. There is an example from AMPCamp:

Re: Remapping columns from a schemaRDD

2014-11-25 Thread Daniel Haviv
Thank you. How can I address more complex columns like maps and structs? Thanks again! Daniel On 25 בנוב׳ 2014, at 19:43, Michael Armbrust mich...@databricks.com wrote: Probably the easiest/closest way to do this would be with a UDF, something like: registerFunction(makeString, (s:

Re: why MatrixFactorizationModel private?

2014-11-25 Thread Xiangrui Meng
Besides API stability concerns, models constructed directly from users rather than returned by ALS may not work well. The userFeatures and productFeatures are both with partitioners so we can perform quick lookup for prediction. If you save userFeatures and productFeatures and load them back, it

RDD C

2014-11-25 Thread sranga
Hi I am noticing that the RDDs that are persisted get cleaned up very quickly. This usually happens in a matter of a few minutes. I tried setting a value of 20 hours for the /spark.cleaner.ttl/ property and still get the same behavior. In my use-case, I have to persist about 20 RDDs each of

Re: why MatrixFactorizationModel private?

2014-11-25 Thread jamborta
hi Xiangrui, thanks. that is a very useful feature. any suggestion on saving/loading the model in the meantime? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/why-MatrixFactorizationModel-private-tp19763p19783.html Sent from the Apache Spark User List

Re: Remapping columns from a schemaRDD

2014-11-25 Thread Michael Armbrust
Maps should just be scala maps, structs are rows inside of rows. If you wan to return a struct from a UDF you can do that with a case class. On Tue, Nov 25, 2014 at 10:25 AM, Daniel Haviv danielru...@gmail.com wrote: Thank you. How can I address more complex columns like maps and structs?

Re: Spark sql UDF for array aggergation

2014-11-25 Thread Michael Armbrust
We don't support native UDAs at the moment in Spark SQL. You can write a UDA using Hive's API and use that within Spark SQL On Tue, Nov 25, 2014 at 10:10 AM, Barua, Seemanto seemanto.ba...@jpmchase.com.invalid wrote: Hi, I am looking for some resources/tutorials that will help me achive

Re: Control number of parquet generated from JavaSchemaRDD

2014-11-25 Thread Michael Armbrust
RDDs are immutable, so calling coalesce doesn't actually change the RDD but instead returns a new RDD that has fewer partitions. You need to save that to a variable and call saveAsParquetFile on the new RDD. On Tue, Nov 25, 2014 at 10:07 AM, tridib tridib.sama...@live.com wrote: public

Re: Control number of parquet generated from JavaSchemaRDD

2014-11-25 Thread tridib
Ohh...how can I miss that. :(. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Control-number-of-parquet-generated-from-JavaSchemaRDD-tp19717p19788.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

using MultipleOutputFormat to ensure one output file per key

2014-11-25 Thread Arpan Ghosh
Hi, How can I implement a custom MultipleOutputFormat and specify it as the output of my Spark job so that I can ensure that there is a unique output file per key (instead of a a unique output file per reducer)? Thanks Arpan

Re: Ideas on how to use Spark for anomaly detection on a stream of data

2014-11-25 Thread Natu Lauchande
Fantastic!!! Exactly what i was looking for. Thanks, Natu On Tue, Nov 25, 2014 at 10:46 AM, Sean Owen so...@cloudera.com wrote: Yes, and I prepared a basic talk on this exact topic. Slides here: http://www.slideshare.net/srowen/anomaly-detection-with-apache-spark-41975155 This is

Re: Control number of parquet generated from JavaSchemaRDD

2014-11-25 Thread tridib
Thanks Michael, It worked like a charm! I have few more queries: 1. Is there a way to control the size of parquet file? 2. Which method do you recommend coalesce(n, true), coalesce(n, false) or repartition(n)? Thanks Regards Tridib -- View this message in context:

Re: rack-topology.sh no such file or directory

2014-11-25 Thread Arun Luthra
Problem was solved by having the admins put this file on the edge nodes. Thanks, Arun On Wed, Nov 19, 2014 at 12:27 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Your Hadoop configuration is set to look for this file to determine racks. Is the file present on cluster nodes? If not, look at

RE: Spark SQL parser bug?

2014-11-25 Thread Leon
Hello I just stumbled on exactly the same issue as you are discussing in this thread. Here are my dependencies: dependencies dependency groupIdcom.datastax.spark/groupId artifactIdspark-cassandra-connector_2.10/artifactId version1.1.0/version

Re: using MultipleOutputFormat to ensure one output file per key

2014-11-25 Thread Rafal Kwasny
Hi, Arpan Ghosh wrote: Hi, How can I implement a custom MultipleOutputFormat and specify it as the output of my Spark job so that I can ensure that there is a unique output file per key (instead of a a unique output file per reducer)? I use something like this: class KeyBasedOutput[T :

How to execute a custom python library on spark

2014-11-25 Thread Chengi Liu
Hi, I have written few datastructures as classes like following.. So, here is my code structure: project/foo/foo.py , __init__.py /bar/bar.py, __init__.py bar.py imports foo as from foo.foo import * /execute/execute.py imports bar as from bar.bar import * Ultimately I am

Re: Spark SQL Join returns less rows that expected

2014-11-25 Thread Yin Huai
I guess you want to use split(\\|) instead of split(|). On Tue, Nov 25, 2014 at 4:51 AM, Cheng Lian lian.cs@gmail.com wrote: Which version are you using? Or if you are using the most recent master or branch-1.2, which commit are you using? On 11/25/14 4:08 PM, david wrote: Hi, I

Kryo NPE with Array

2014-11-25 Thread Simone Franzini
I am running into the following NullPointerException: com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException Serialization trace: underlying (scala.collection.convert.Wrappers$JListWrapper) myArrayField (MyCaseClass) at

Re: How to execute a custom python library on spark

2014-11-25 Thread jay vyas
a quick thought on this: I think this is distro dependent also, right? We ran into a similar issue in https://issues.apache.org/jira/browse/BIGTOP-1546 where it looked like the python libraries might be overwritten on launch. On Tue, Nov 25, 2014 at 3:09 PM, Chengi Liu chengi.liu...@gmail.com

Re: Why is this operation so expensive

2014-11-25 Thread Andrew Ash
Hi Steve, You changed the first value in a Tuple2, which is the one that Spark uses to hash and determine where in the cluster to place the value. By changing the first part of the PairRDD, you've implicitly asked Spark to reshuffle the data according to the new keys. I'd guess that you would

Re: Configuring custom input format

2014-11-25 Thread Harihar Nahak
Hi, I'm trying to make custom input format for CSV file, if you can share little bit more what you read as input and what things you have implemented. I'll try to replicate the same things. If I find something interesting at my end I'll let you know. Thanks, Harihar - --Harihar --

Re: Configuring custom input format

2014-11-25 Thread Matei Zaharia
How are you creating the object in your Scala shell? Maybe you can write a function that directly returns the RDD, without assigning the object to a temporary variable. Matei On Nov 5, 2014, at 2:54 PM, Corey Nolet cjno...@gmail.com wrote: The closer I look @ the stack trace in the Scala

Data Source for Spark SQL

2014-11-25 Thread ken
I am using Spark SQL from Hive table with Parquet SerDe. Most queries are executed from Spark's JDBC Thrift server. Is there more efficient way to access/query data? For example, using saveAsParquetFile() and parquetFile() to save/load Parquet data and run queries directly? Thanks, Ken -- View

Re: Configuring custom input format

2014-11-25 Thread Corey Nolet
I was wiring up my job in the shell while i was learning Spark/Scala. I'm getting more comfortable with them both now so I've been mostly testing through Intellij with mock data as inputs. I think the problem lies more on Hadoop than Spark as the Job object seems to check it's state and throw an

RE: Spark SQL parser bug?

2014-11-25 Thread Mohammed Guller
Leon, I solved the problem by creating a work around for it, so didn't have a need to upgrade to 1.1.2-SNAPSHOT. Mohammed -Original Message- From: Leon [mailto:pachku...@gmail.com] Sent: Tuesday, November 25, 2014 11:36 AM To: u...@spark.incubator.apache.org Subject: RE: Spark SQL

Submitting job from local to EC2 cluster

2014-11-25 Thread Yingkai Hu
Hi All, I have spark deployed to an EC2 cluster and were able to run jobs successfully when drive is reside within the cluster. However, job was killed when I tried to submit it from local. My guess is spark cluster can’t open connection back to the driver since it is on my machine. I’m

Classpath issue: Custom authentication with sparkSQL/Spark 1.2

2014-11-25 Thread arin.g
Hi, I am trying to launch a spark 1.2 cluster with SparkSQL and custom authentication. After launching the cluster using the ec2 scripts, I copied the following hive-site.xml file into spark/conf dir: /configuration property namehive.server2.authentication/name valueCUSTOM/value /property

RE: Creating a front-end for output from Spark/PySpark

2014-11-25 Thread Mohammed Guller
Two options that I can think of: 1) Use the Spark SQL Thrift/JDBC server. 2) Develop a web app using some framework such as Play and expose a set of REST APIs for sending queries. Inside your web app backend, you initialize the Spark SQL context only once when your app initializes.

RE: querying data from Cassandra through the Spark SQL Thrift JDBC server

2014-11-25 Thread Mohammed Guller
Thanks, Cheng. As an FYI for others trying to integrate Spark SQL JDBC server with Cassandra - I ended up using CalliopeServer2, which extends the Thrift Server and it was really straightforward. Mohammed From: Cheng Lian [mailto:lian.cs@gmail.com] Sent: Saturday, November 22, 2014 3:54

RE: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava

2014-11-25 Thread Judy Nash
I traced the code and used the following to call: Spark-class.cmd org.apache.spark.deploy.SparkSubmit --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 spark-internal --hiveconf hive.server2.thrift.port=1 The issue ended up to be much more fundamental however. Spark doesn’t

Re: Why is this operation so expensive

2014-11-25 Thread Steve Lewis
If I combineByKey in the next step I suppose I am paying for a shuffle I need any way - right? Also if I supply a custom partitioner rather than hash can I control where and how data is shuffled - overriding equals and hashcode could be a bad thing but a custom partitioner is less dangerous On

Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava

2014-11-25 Thread Denny Lee
To determine if this is a Windows vs. other configuration, can you just try to call the Spark-class.cmd SparkSubmit without actually referencing the Hadoop or Thrift server classes? On Tue Nov 25 2014 at 5:42:09 PM Judy Nash judyn...@exchange.microsoft.com wrote: I traced the code and used

Re: Control number of parquet generated from JavaSchemaRDD

2014-11-25 Thread Michael Armbrust
I believe coalesce(..., true) and repartition are the same. If the input files are of similar sizes, then coalesce will be cheaper as it introduces a narrow dependency https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf, meaning there won't be a shuffle. However, if there

Re: Configuring custom input format

2014-11-25 Thread Matei Zaharia
Yeah, unfortunately that will be up to them to fix, though it wouldn't hurt to send them a JIRA mentioning this. Matei On Nov 25, 2014, at 2:58 PM, Corey Nolet cjno...@gmail.com wrote: I was wiring up my job in the shell while i was learning Spark/Scala. I'm getting more comfortable with

IDF model error

2014-11-25 Thread Shivani Rao
Hello Spark fans, I am trying to use the IDF model available in the spark mllib to create an tf-idf representation of a n RDD[Vectors]. Below i have attached my MWE I get the following error java.lang.IndexOutOfBoundsException: 7 not in [-4,4) at

Issue with Spark latest 1.2.0 build - ClassCastException from [B to SerializableWritable

2014-11-25 Thread lokeshkumar
Hello forum, We are using spark distro built from the source of latest 1.2.0 tag. And we are facing the below issue, while trying to act upon the JavaRDD instance, the stacktrace is given below. Can anyone please let me know, what can be wrong here? java.lang.ClassCastException: [B cannot be

Spark on YARN - master role

2014-11-25 Thread Praveen Sripati
Hi, In the Spark on YARN, the AM (driver) will ask the RM for resources. Once the resources are allocated by the RM, the AM will start the executors through the NM. This is my understanding. But, according to the Spark documentation (1), the `spark.yarn.applicationMaster.waitTries` properties

Re: Inaccurate Estimate of weights model from StreamingLinearRegressionWithSGD

2014-11-25 Thread Yanbo Liang
Hi Tri, setIntercept() is not a member function of StreamingLinearRegressionWithSGD, it's a member function of LinearRegressionWithSGD(GeneralizedLinearAlgorithm) which is a member variable(named algorithm) of StreamingLinearRegressionWithSGD. So you need to change your code to: val model = new

Re: IDF model error

2014-11-25 Thread Yanbo Liang
Hi Shivani, You misunderstand the parameter of SparseVector. class SparseVector( override val size: Int, val indices: Array[Int], val values: Array[Double]) extends Vector { } The first parameter is the total length of the Vector rather than the length of non-zero elements. So it

do not assemble the spark example jar

2014-11-25 Thread lihu
Hi, The spark assembly is time costly. If I only need the spark-assembly-1.1.0-hadoop2.3.0.jar, do not need the spark-examples-1.1.0-hadoop2.3.0.jar. How to configure the spark to avoid assemble the example jar. I know *export SPARK_PREPEND_CLASSES=**true* method can reduce the assembly, but

Re: do not assemble the spark example jar

2014-11-25 Thread Matei Zaharia
You can do sbt/sbt assembly/assembly to assemble only the main package. Matei On Nov 25, 2014, at 7:50 PM, lihu lihu...@gmail.com wrote: Hi, The spark assembly is time costly. If I only need the spark-assembly-1.1.0-hadoop2.3.0.jar, do not need the

Re: do not assemble the spark example jar

2014-11-25 Thread Matei Zaharia
BTW as another tip, it helps to keep the SBT console open as you make source changes (by just running sbt/sbt with no args). It's a lot faster the second time it builds something. Matei On Nov 25, 2014, at 8:31 PM, Matei Zaharia matei.zaha...@gmail.com wrote: You can do sbt/sbt

Issue with Spark latest 1.2.0 build - ClassCastException from [B to SerializableWritable

2014-11-25 Thread lokeshkumar
Hello forum, We are using spark distro built from the source of latest 1.2.0 tag. And we are facing the below issue, while trying to act upon the JavaRDD instance, the stacktrace is given below. Can anyone please let me know, what can be wrong here? java.lang.ClassCastException: [B cannot be

Re: Determine number of running executors

2014-11-25 Thread Tobias Pfeiffer
Hi, Thanks for your help! Sandy, I had a bit of trouble finding the spark.executor.cores property. (It wasn't there although its value should have been 2.) I ended up throwing regular expressions on scala.util.Properties.propOrElse(sun.java.command, ), which worked surprisingly well ;-) Thanks

RE: beeline via spark thrift doesn't retain cache

2014-11-25 Thread Judy Nash
Thanks Yanbo. My issue was 1) . I had spark thrift server setup, but it was running against hive instead of Spark SQL due a local change. After I fix this, beeline automatically caches rerun queries + accepts cache table. From: Yanbo Liang [mailto:yanboha...@gmail.com] Sent: Friday, November

Spark 1.1.0 and HBase: Snappy UnsatisfiedLinkError

2014-11-25 Thread Pietro Gentile
Hi everyone, I deployed Spark 1.1.0 and I m trying to use it with spark-job-server 0.4.0 (https://github.com/ooyala/spark-jobserver). I previously used Spark 1.0.2 and had no problems with it. I want to use the newer version of Spark (and Spark SQL) to create the SchemaRDD programmatically.

Accessing posterior probability of Naive Baye's prediction

2014-11-25 Thread jatinpreet
Hi, I am trying to access the posterior probability of Naive Baye's prediction with MLlib using Java. As the member variables brzPi and brzTheta are private, I applied a hack to access the values through reflection. I am using Java and couldn't find a way to use the breeze library with Java. If

Re: How to do broadcast join in SparkSQL

2014-11-25 Thread Jianshi Huang
Hi, Looks like the latest SparkSQL with Hive 0.12 has a bug in Parquet support. I got the following exceptions: org.apache.hadoop.hive.ql.parse.SemanticException: Output Format must implement HiveOutputFormat, otherwise it should be either IgnoreKeyTextOutputFormat or SequenceFileOutputFormat

Re: How to do broadcast join in SparkSQL

2014-11-25 Thread Jianshi Huang
Oh, I found a explanation from http://cmenguy.github.io/blog/2013/10/30/using-hive-with-parquet-format-in-cdh-4-dot-3/ The error here is a bit misleading, what it really means is that the class parquet.hive.DeprecatedParquetOutputFormat isn’t in the classpath for Hive. Sure enough, doing a ls

Re: k-means clustering

2014-11-25 Thread Yanbo Liang
Pre-processing is major workload before training model. MLlib provide TD-IDF calculation, StandardScaler and Normalizer which is essential for preprocessing and would be great help to the model training. Take a look at this http://spark.apache.org/docs/latest/mllib-feature-extraction.html

RE: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava

2014-11-25 Thread Judy Nash
Looks like a config issue. I ran spark-pi job and still failing with the same guava error Command ran: .\bin\spark-class.cmd org.apache.spark.deploy.SparkSubmit --class org.apache.spark.examples.SparkPi --master spark://headnodehost:7077 --executor-memory 1G --num-executors 1

Spark setup on local windows machine

2014-11-25 Thread Sunita Arvind
Hi All, I just installed a spark on my laptop and trying to get spark-shell to work. Here is the error I see: C:\spark\binspark-shell Exception in thread main java.util.NoSuchElementException: key not found: CLAS SPATH at scala.collection.MapLike$class.default(MapLike.scala:228)

configure to run multiple tasks on a core

2014-11-25 Thread yotto
I'm running a spark-ec2 cluster. I have a map task that calls a specialized C++ external app. The app doesn't fully utilize the core as it needs to download/upload data as part of the task. Looking at the worker nodes, it appears that there is one task with my app running per core. I'd like to

Re: Lifecycle of RDD in spark-streaming

2014-11-25 Thread Mukesh Jha
Any pointers guys? On Tue, Nov 25, 2014 at 5:32 PM, Mukesh Jha me.mukesh@gmail.com wrote: Hey Experts, I wanted to understand in detail about the lifecycle of rdd(s) in a streaming app. From my current understanding - rdd gets created out of the realtime input stream. - Transform(s)

Re: Submitting job from local to EC2 cluster

2014-11-25 Thread Akhil Das
Yes, it is possible to submit jobs to a remote spark cluster. Just make sure you follow the below steps. 1. Set spark.driver.host to your local ip (Where you runs your code, and it should be accessible from the cluster) 2. Make sure no firewall/router configurations are blocking/filtering the

Re: Spark setup on local windows machine

2014-11-25 Thread Akhil Das
You could try following this guidelines http://docs.sigmoidanalytics.com/index.php/How_to_build_SPARK_on_Windows Thanks Best Regards On Wed, Nov 26, 2014 at 12:24 PM, Sunita Arvind sunitarv...@gmail.com wrote: Hi All, I just installed a spark on my laptop and trying to get spark-shell to

Re: do not assemble the spark example jar

2014-11-25 Thread lihu
Mater, thank you very much! After take your advice, the time for assembly from about 20min down to 6min in my computer. that's a very big improvement. On Wed, Nov 26, 2014 at 12:32 PM, Matei Zaharia matei.zaha...@gmail.com wrote: BTW as another tip, it helps to keep the SBT console open as you

Re: do not assemble the spark example jar

2014-11-25 Thread lihu
Matei, sorry for my last typo error. And the tip can improve about 30s in my computer. On Wed, Nov 26, 2014 at 3:34 PM, lihu lihu...@gmail.com wrote: Mater, thank you very much! After take your advice, the time for assembly from about 20min down to 6min in my computer. that's a very big

Re: Spark setup on local windows machine

2014-11-25 Thread Sameer Farooqui
Hi Sunita, This gitbook may also be useful for you to get Spark running in local mode on your Windows machine: http://blueplastic.gitbooks.io/how-to-light-your-spark-on-a-stick/content/ On Tue, Nov 25, 2014 at 11:09 PM, Akhil Das ak...@sigmoidanalytics.com wrote: You could try following this