spark table to hive table

2014-05-27 Thread 정재부
Title: Samsung Enterprise Portal mySingle Hi all, I'm trying tocomparefunctionsavailable in Spark1.0 hqlto original HiveQL. But, when I testedfunctions such as 'rank', Spark didn't support some HiveQL functions. In case of Shark, it supports functions as well as Hive so I want to convert

Re: maprfs and spark libraries

2014-05-27 Thread nelson
As simple as that. Indeed, the spark jar i was linking to wasn't the mapr version. I just added spark-assembly-0.9.1-hadoop1.0.3-mapr-3.0.3.jar to the lib directory of my project as a unmanaged dependency for sbt. Thank you Cafe au Lait and to all of you guys. Regards, Nelson. -- View this

Map failed [dupliacte 1] error

2014-05-27 Thread Joe L
Hi, I am getting the following error but I don't understand what the problem is. 14/05/27 17:44:29 INFO TaskSetManager: Loss was due to java.io.IOException: Map failed [duplicate 15] 14/05/27 17:44:30 INFO TaskSetManager: Starting task 47.0:43 as TID 60281 on executor 0: cm07 (PROCESS_LOCAL)

Re: how to set task number?

2014-05-27 Thread qingyang li
when i using create table bigtable002 tblproperties('shark.cache'=' tachyon') as select * from bigtable001 limit 40; , there will be 4 files created on tachyon. but when i using create table bigtable002 tblproperties('shark.cache'=' tachyon') as select * from bigtable001 ; , there will be 35

Re: how to control task number?

2014-05-27 Thread qingyang li
when i using create table bigtable002 tblproperties('shark.cache'=' tachyon') as select * from bigtable001 limit 40; , there will be 4 files created on tachyon. but when i using create table bigtable002 tblproperties('shark.cache'=' tachyon') as select * from bigtable001 ; , there will be 35

too many temporary app files left after app finished

2014-05-27 Thread Cheney Sun
Hi, We use spark 0.9.1 in standalone mode. We found lots of app temporary files didn't get removed in each worker local file system even while the job was finished. These folder have names such as app-20140516120842-0203. These files occupied so many disk storage that we have to run a deamon

Re: spark table to hive table

2014-05-27 Thread John Omernik
Did you try the Hive Context? Look under Hive Support here: http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html On Tue, May 27, 2014 at 2:09 AM, 정재부 itsjb.j...@samsung.com wrote: Hi all, I'm trying to compare functions available in Spark1.0 hql to original

Re: K-nearest neighbors search in Spark

2014-05-27 Thread Carter
Any suggestion is very much appreciated. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/K-nearest-neighbors-search-in-Spark-tp6393p6421.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Spark streaming issue

2014-05-27 Thread Sourav Chandra
HI, I am facing a weird issue. I am using spark 0.9 and running a streaming application. In the UI, the duration shows order of seconds but if I dig into that particular stage details, it shows total time taken across all tasks for the stage is much much less (in milliseconds) I am using Fair

Re: Computing cosine similiarity using pyspark

2014-05-27 Thread Jeremy Freeman
Hi Jamal, One nice feature of PySpark is that you can easily use existing functions from NumPy and SciPy inside your Spark code. For a simple example, the following uses Spark's cartesian operation (which combines pairs of vectors into tuples), followed by NumPy's corrcoef to compute the pearson

Re: Spark Summit 2014 (Hotel suggestions)

2014-05-27 Thread Pierre B
Hi everyone! Any recommendation anyone? Pierre -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Summit-2014-Hotel-suggestions-tp5457p6424.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Spark On Mesos

2014-05-27 Thread Gileny
Hello,I’ve installed Spark Cluster spark-0.9.0-incubating-bin-hadoop1, which works fine.Also, on the same cluster I’ve installed Mesos cluster, using mesos_0.18.2_x86_64.rpm, which works fine as well.Now,I was trying to followed the instructions from

Re: Working with Avro Generic Records in the interactive scala shell

2014-05-27 Thread Jeremy Lewi
I was able to work around this by switching to the SpecificDatum interface and following this example: https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SerializableAminoAcid.java As in the example, I defined a subclass of my Avro type which implemented the

Re: KryoSerializer Exception

2014-05-27 Thread jaranda
I am experiencing the same issue (I tried both using Kryo as serializer and increasing the buffer size up to 256M, my objects are much smaller though). I share my registrator class just in case: https://gist.github.com/JordiAranda/5cc16cf102290c413c82 Any hints would be highly appreciated.

Re: Spark Summit 2014 (Hotel suggestions)

2014-05-27 Thread Gary Malouf
Go to expedia/orbitz and look for hotels in the union square neighborhood. In my humble opinion having visited San Francisco, it is worth any extra cost to be as close as possible to the conference vs having to travel from other parts of the city. On Tue, May 27, 2014 at 9:36 AM, Gerard Maas

Re: Spark Summit 2014 (Hotel suggestions)

2014-05-27 Thread Jerry Lam
Hi guys, I ended up reserving a room at the Phoenix (Hotel: http://www.jdvhotels.com/hotels/california/san-francisco-hotels/phoenix-hotel) recommended by my friend who has been in SF. According to Google, it takes 11min to walk to the conference which is not too bad. Hope this helps! Jerry

Re: Running a spark-submit compatible app in spark-shell

2014-05-27 Thread Roger Hoover
Thanks, Andrew. I'll give it a try. On Mon, May 26, 2014 at 2:22 PM, Andrew Or and...@databricks.com wrote: Hi Roger, This was due to a bug in the Spark shell code, and is fixed in the latest master (and RC11). Here is the commit that fixed it:

Re: Working with Avro Generic Records in the interactive scala shell

2014-05-27 Thread Jeremy Lewi
Thanks that's super helpful. J On Tue, May 27, 2014 at 8:01 AM, Matt Massie mas...@berkeley.edu wrote: I really should update that blog post. I created a gist (see https://gist.github.com/massie/7224868) which explains a cleaner, more efficient approach. -- Matt

Re: Working with Avro Generic Records in the interactive scala shell

2014-05-27 Thread Andrew Ash
Also see this context from February. We started working with Chill to get Avro records automatically registered with Kryo. I'm not sure the final status, but from the Chill PR #172 it looks like this might be much less friction than before. Issue we filed:

Re: K-nearest neighbors search in Spark

2014-05-27 Thread Andrew Ash
Hi Carter, In Spark 1.0 there will be an implementation of k-means available as part of MLLib. You can see the documentation for that below (until 1.0 is fully released). https://people.apache.org/~pwendell/spark-1.0.0-rc9-docs/mllib-clustering.html Maybe diving into the source here will help

Persist and unpersist

2014-05-27 Thread Daniel Darabos
I keep bumping into a problem with persisting RDDs. Consider this (silly) example: def everySecondFromBehind(input: RDD[Int]): RDD[Int] = { val count = input.count if (count % 2 == 0) { return input.filter(_ % 2 == 1) } else { return input.filter(_ % 2 == 0) } } The situation is

Re: file not found

2014-05-27 Thread jaranda
Thanks for the heads up, I also experienced this issue. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/file-not-found-tp1854p6438.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Persist and unpersist

2014-05-27 Thread Nicholas Chammas
Daniel, Is SPARK-1103 https://issues.apache.org/jira/browse/SPARK-1103 related to your example? Automatic unpersist()-ing of unreferenced RDDs would be nice. Nick ​ On Tue, May 27, 2014 at 12:28 PM, Daniel Darabos daniel.dara...@lynxanalytics.com wrote: I keep bumping into a problem with

Re: Akka disassociation on Java SE Embedded

2014-05-27 Thread Aaron Davidson
Sorry, to clarify: Spark *does* effectively turn Akka's failure detector off. On Tue, May 27, 2014 at 10:47 AM, Aaron Davidson ilike...@gmail.com wrote: Spark should effectively turn Akka's failure detector off, because we historically had problems with GCs and other issues causing

Re: Akka disassociation on Java SE Embedded

2014-05-27 Thread Aaron Davidson
Spark should effectively turn Akka's failure detector off, because we historically had problems with GCs and other issues causing disassociations. The only thing that should cause these messages nowadays is if the TCP connection (which Akka sustains between Actor Systems on different machines)

proximity of events within the next group of events instead of time

2014-05-27 Thread Navarro, John
Hi, Spark newbie here with a general question In a stream consisting of several types of events, how can I detect if event X happened within Z transactions of event Y? is it just a matter of iterating thru all the RDDs, when event type Y found, take the next Z transactions and check if

Re: Akka disassociation on Java SE Embedded

2014-05-27 Thread Chanwit Kaewkasi
May be that's explaining mine too. Thank you very much, Aaron !! Best regards, -chanwit -- Chanwit Kaewkasi linkedin.com/in/chanwit On Wed, May 28, 2014 at 12:47 AM, Aaron Davidson ilike...@gmail.com wrote: Spark should effectively turn Akka's failure detector off, because we historically

Running Jars on Spark, program just hanging there

2014-05-27 Thread Min Li
Hi all, I've a single machine with 8 cores and 8g mem. I've deployed the standalone spark on the machine and successfully run the examples. Now I'm trying to write some simple java codes. I just read a local file (23M) into string list and use JavaRDDString rdds = sparkContext.paralellize()

Re: Broadcast Variables

2014-05-27 Thread Puneet Lakhina
To answer my own question, that does seem to be the right way. I was concerned about whether the data that a broadcast variable would end up getting serialized if I used it as an instance variable of the function. I realized that doesnt happen because the broadcast variable's value is marked as

Re: Invalid Class Exception

2014-05-27 Thread Suman Somasundar
I am running this on a Solaris machine with logical partitions. All the partitions (workers) access the same Spark folder. Thanks, Suman. On 5/23/2014 9:44 PM, Andrew Or wrote: That means not all of your driver and executors have the same version of Spark. Are you on a standalone EC2

Re: Invalid Class Exception

2014-05-27 Thread Marcelo Vanzin
On Tue, May 27, 2014 at 1:05 PM, Suman Somasundar suman.somasun...@oracle.com wrote: I am running this on a Solaris machine with logical partitions. All the partitions (workers) access the same Spark folder. Can you check whether you have multiple versions of the offending class

Re: Running Jars on Spark, program just hanging there

2014-05-27 Thread Yana Kadiyska
Does the spark UI show your program running? (http://spark-masterIP:8118). If the program is listed as running you should be able to see details via the UI. In my experience there are 3 sets of logs -- the log where you're running your program (the driver), the log on the master node, and the log

Spark 1.0: slf4j version conflicts with pig

2014-05-27 Thread Ryan Compton
I use both Pig and Spark. All my code is built with Maven into a giant *-jar-with-dependencies.jar. I recently upgraded to Spark 1.0 and now all my pig scripts fail with: Caused by: java.lang.RuntimeException: Could not resolve error that occured when launching map reduce job:

Re: Spark 1.0: slf4j version conflicts with pig

2014-05-27 Thread Sean Owen
Spark uses 1.7.5, and you should probably see 1.7.{4,5} in use through Hadoop. But those are compatible. That method appears to have been around since 1.3. What version does Pig want? I usually do mvn -Dverbose dependency:tree to see both what the final dependencies are, and what got

Re: Persist and unpersist

2014-05-27 Thread Ankur Dave
I think what's desired here is for input to be unpersisted automatically as soon as result is materialized. I don't think there's currently a way to do this, but the usual workaround is to force result to be materialized immediately and then unpersist input: input.cache()val count =

Java RDD structure for Matrix predict?

2014-05-27 Thread Sandeep Parikh
I've got a trained MatrixFactorizationModel via ALS.train(...) and now I'm trying to use it to predict some ratings like so: JavaRDDRating predictions = model.predict(usersProducts.rdd()) Where usersProducts is built from an existing Ratings dataset like so: JavaPairRDDInteger,Integer

Re: Java RDD structure for Matrix predict?

2014-05-27 Thread giive chen
Hi Sandeep I think you should use testRatings.mapToPair instead of testRatings.map. So the code should be JavaPairRDDInteger,Integer usersProducts = training.mapToPair( new PairFunctionRating, Integer, Integer() { public Tuple2Integer, Integer

Re: K-nearest neighbors search in Spark

2014-05-27 Thread Krishna Sankar
Carter, Just as a quick simple starting point for Spark. (caveats - lots of improvements reqd for scaling, graceful and efficient handling of RDD et al): import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import scala.collection.immutable.ListMap import

Spark Memory Bounds

2014-05-27 Thread Keith Simmons
I'm trying to determine how to bound my memory use in a job working with more data than can simultaneously fit in RAM. From reading the tuning guide, my impression is that Spark's memory usage is roughly the following: (A) In-Memory RDD use + (B) In memory Shuffle use + (C) Transient memory used

Re: Re: spark table to hive table

2014-05-27 Thread JaeBoo Jung
Title: Samsung Enterprise Portal mySingle I already tried HiveContext as well as SqlContext. Butitseems that Spark'sHiveContext is not completely same as Apache Hive. For example, SQL like 'SELECT RANK() OVER(ORDER BY VAL1 ASC) FROM TEST LIMIT 10' works fine in Apache Hive, butSpark's Hive

AMPCamp Training materials are broken due to overwritten AMIs?

2014-05-27 Thread Toshinari Kureha
Hi, Has anyone had luck going through previous archives of the AMPCamp exercises? Many of the archived bootcamps seem to be broken due to the fact that it references the same AMIs that is constantly being updated, which means that it is no longer compatible with the old bootcamp instructions or

Re: Spark Memory Bounds

2014-05-27 Thread Christopher Nguyen
Keith, do you mean bound as in (a) strictly control to some quantifiable limit, or (b) try to minimize the amount used by each task? If a, then that is outside the scope of Spark's memory management, which you should think of as an application-level (that is, above JVM) mechanism. In this scope,

Re: Spark Memory Bounds

2014-05-27 Thread Keith Simmons
A dash of both. I want to know enough that I can reason about, rather than strictly control, the amount of memory Spark will use. If I have a big data set, I want to understand how I can design it so that Spark's memory consumption falls below my available resources. Or alternatively, if it's