Re: spark mesos deployment : starting workers based on attributes

2015-04-04 Thread Ankur Chauhan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, Created issue: https://issues.apache.org/jira/browse/SPARK-6707 I would really appreciate ideas/views/opinions on this feature. - -- Ankur Chauhan On 03/04/2015 13:23, Tim Chen wrote: Hi Ankur, There isn't a way to do that yet, but it's

Re: conversion from java collection type to scala JavaRDDObject

2015-04-04 Thread Jeetendra Gangele
Hi I have tried with parallelize but i got the below exception java.io.NotSerializableException: pacific.dr.VendorRecord Here is my code ListVendorRecord vendorRecords=blockingKeys.getMatchingRecordsWithscan(matchKeysOutput); JavaRDDVendorRecord lines = sc.parallelize(vendorRecords) On 2

Spark Vs MR

2015-04-04 Thread SamyaMaiti
How is spark faster than MR when data is in disk in both cases? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Vs-MR-tp22373.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Migrating from Spark 0.8.0 to Spark 1.3.0

2015-04-04 Thread Nick Pentreath
It shouldn't be too bad - pertinent changes migration notes are here:  http://spark.apache.org/docs/1.0.0/programming-guide.html#migrating-from-pre-10-versions-of-spark  for pre-1.0 and here:  http://spark.apache.org/docs/latest/sql-programming-guide.html#upgrading-from-spark-sql-10-12-to-13  for

Issue of sqlContext.createExternalTable with parquet partition discovery after changing folder structure

2015-04-04 Thread Rex Xiong
Hi Spark Users, I'm testing 1.3 new feature of parquet partition discovery. I have 2 sub folders, each has 800 rows. /data/table1/key=1 /data/table1/key=2 In spark-shell, run this command: val t = sqlContext.createExternalTable(table1, hdfs:///data/table1, parquet) t.count It shows 1600

Re: 4 seconds to count 13M lines. Does it make sense?

2015-04-04 Thread SamyaMaiti
Reduce *spark.sql.shuffle.partitions* from default of 200 to total number of cores. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/4-seconds-to-count-13M-lines-Does-it-make-sense-tp22360p22374.html Sent from the Apache Spark User List mailing list archive

Re: Spark Vs MR

2015-04-04 Thread Sean Owen
If data is on HDFS, it is not read any more or less quickly by either framework. Both are in fact using the same logic to exploit locality, and read and deserialize data anyway. I don't think this is what anyone claims though. Spark can be faster in a multi-stage operation, which would require

Need help with ALS Recommendation code

2015-04-04 Thread Phani Yadavilli -X (pyadavil)
Hi , I am trying to run the following command in the Movie Recommendation example provided by the ampcamp tutorial Command: sbt package run /movielens/medium Exception: sbt.TrapExitSecurityException thrown from the UncaughtExceptionHandler in thread run-main-0 java.lang.RuntimeException:

Re: newAPIHadoopRDD Mutiple scan result return from Hbase

2015-04-04 Thread Jeetendra Gangele
Here is my conf object passing first parameter of API. but here I want to pass multiple scan means i have 4 criteria for STRAT ROW and STOROW in same table. by using below code i can get result for one STARTROW and ENDROW. Configuration conf = DBConfiguration.getConf(); // int scannerTimeout =

Re: Parquet timestamp support for Hive?

2015-04-04 Thread Cheng Lian
Avoiding maintaining a separate Hive version is one of the initial purpose of Spark SQL. (We had once done this for Shark.) The org.spark-project.hive:hive-0.13.1a artifact only cleans up some 3rd dependencies to avoid dependency hell in Spark. This artifact is exactly the same as Hive 0.13.1

Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

2015-04-04 Thread Cheng Lian
I think this is a bug of Spark SQL dates back to at least 1.1.0. The json_tuple function is implemented as org.apache.hadoop.hive.ql.udf.generic.GenericUDTFJSONTuple. The ClassNotFoundException should complain with the class name rather than the UDTF function name. The problematic line

Re: Spark Streaming FileStream Nested File Support

2015-04-04 Thread Akhil Das
We've a custom version/build of sparktreaming doing the nested s3 lookups faster (uses native S3 APIs). You can find the source code over here : https://github.com/sigmoidanalytics/spark-modified, In particular the changes from here

newAPIHadoopRDD Mutiple scan result return from Hbase

2015-04-04 Thread Jeetendra Gangele
Hi All, Can we get the result of the multiple scan from JavaSparkContext.newAPIHadoopRDD from Hbase. This method first parameter take configuration object where I have added filter. but how Can I query multiple scan from same table calling this API only once? regards jeetendra

Re: Parquet Hive table become very slow on 1.3?

2015-04-04 Thread Cheng Lian
Hey Xudong, We had been digging this issue for a while, and believe PR 5339 http://github.com/apache/spark/pull/5339 and PR 5334 http://github.com/apache/spark/pull/5339 should fix this issue. There two problems: 1. Normally we cache Parquet table metadata for better performance, but when

Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

2015-04-04 Thread Cheng Lian
Filed https://issues.apache.org/jira/browse/SPARK-6708 to track this. Cheng On 4/4/15 10:21 PM, Cheng Lian wrote: I think this is a bug of Spark SQL dates back to at least 1.1.0. The json_tuple function is implemented as org.apache.hadoop.hive.ql.udf.generic.GenericUDTFJSONTuple. The

Re: conversion from java collection type to scala JavaRDDObject

2015-04-04 Thread Dean Wampler
Without the rest of your code, it's hard to know what might be unserializable. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler

Re: Issue of sqlContext.createExternalTable with parquet partition discovery after changing folder structure

2015-04-04 Thread Cheng Lian
You need to refresh the external table manually after updating the data source outside Spark SQL: - via Scala API: sqlContext.refreshTable(table1) - via SQL: REFRESH TABLE table1; Cheng On 4/4/15 5:24 PM, Rex Xiong wrote: Hi Spark Users, I'm testing 1.3 new feature of parquet partition

DataFrame groupBy MapType

2015-04-04 Thread Justin Yip
Hello, I have a case class like this: case class A( m: Map[Long, Long], ... ) and constructed a DataFrame from Seq[A]. I would like to perform a groupBy on A.m(SomeKey). I can implement a UDF, create a new Column then invoke a groupBy on the new Column. But is it the idiomatic way of doing

Spark Streaming program questions

2015-04-04 Thread nickos168
I have two questions: 1) In a Spark Streaming program, after the various DStream transformations have being setup, the ssc.start() method is called to start the computation. Can the underlying DAG change (ie. add another map or maybe a join) after ssc.start() has been called (and maybe

Re: Spark Streaming program questions

2015-04-04 Thread Aj K
UNSUBSCRIBE On Sun, Apr 5, 2015 at 6:43 AM, nickos168 nickos...@yahoo.com.invalid wrote: I have two questions: 1) In a Spark Streaming program, after the various DStream transformations have being setup, the ssc.start() method is called to start the computation. Can the underlying DAG

Re: Spark SQL Self join with agreegate

2015-04-04 Thread SachinJanani
I am not sure whether this can be possible but i have tried something like SELECT time, src, dst, sum(val1), sum(val2) from table group by src,dst; and it works.I think it will result the same answer as you are expecting -- View this message in context:

Re: Spark + Kinesis

2015-04-04 Thread Vadim Bichutskiy
Hi all, More good news! I was able to utilize mergeStrategy to assembly my Kinesis consumer into an uber jar Here's what I added to* build.sbt:* *mergeStrategy in assembly = (mergeStrategy in assembly) { (old) =* * {* * case PathList(com, esotericsoftware, minlog, xs @ _*) =

Processing Time Spikes (Spark Streaming)

2015-04-04 Thread t1ny
Hi all, I am running some benchmarks on a simple Spark application which consists of : - textFileStream() to extract text records from HDFS files - map() to parse records into JSON objects - updateStateByKey() to calculate and store an in-memory state for each key. The processing time per batch

UNRESOLVED DEPENDENCIES while building Spark 1.3.0

2015-04-04 Thread mas
Hi All, I am trying to build spark 1.3.0 on Ubuntu 14.04 Stand alone machine. I am using sbt/sbt assembly command to build it. However, this command works pretty fine with spark version 1.1.0 but for Spark 1.3 it gives following error. Any help or suggestions to resolve this problem will highly

UNRESOLVED DEPENDENCIES while building Spark 1.3.0

2015-04-04 Thread mas
Hi All, I am trying to build spark 1.3.O on standalone Ubuntu 14.04. I am using the sbt command i.e. sbt/sbt assembly to build it. This command works pretty good with spark version 1.1 however, it gives following error with spark 1.3.0. Any help or suggestions to resolve this would highly be

Re: UNRESOLVED DEPENDENCIES while building Spark 1.3.0

2015-04-04 Thread Dean Wampler
Use the MVN build instead. From the README in the git repo ( https://github.com/apache/spark) mvn -DskipTests clean package Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler

CPU Usage for Spark Local Mode

2015-04-04 Thread Wenlei Xie
Hi, I am currently testing my application with Spark under local mode, and I set the master to be local[4]. One thing I note is that when there is groupBy/reduceBy operation involved, the CPU usage can sometimes be around 600% to 800%. I am wondering if this is expected? (As only 4 worker threads